Power and performance tradeoffs for data cache prefetching in low-power embedded systems by Reungsang, Pipat
Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 
1-1-2001 
Power and performance tradeoffs for data cache prefetching in 
low-power embedded systems 
Pipat Reungsang 
Iowa State University 
Follow this and additional works at: https://lib.dr.iastate.edu/rtd 
Recommended Citation 
Reungsang, Pipat, "Power and performance tradeoffs for data cache prefetching in low-power embedded 
systems" (2001). Retrospective Theses and Dissertations. 21487. 
https://lib.dr.iastate.edu/rtd/21487 
This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and 
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses 
and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, 
please contact digirep@iastate.edu. 
Power and performance tradeoffs for data cache pref etching 
in low-power embedded systems 
by 
Pi pat Reungsang 
A thesis submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
MASTER OF SCIENCE 
Major: Computer Engineering 
Major Professor: Gyungho Lee 





Iowa State University 
This is to certify that the Master's thesis of 
Pipat Reungsang 
has met the thesis requirements of Iowa State University 
Signatures have been redacted for privacy 
iii 
TABLE OF CONTENTS 
LIST OF FIGURES ............................................................................................................... iv 
LIST OF TABLES .................................................................................................................. V 
1 INTRODUCTION ............................................................................................................... 1 
2 RELATED WORKS ........................................................................................................... 4 
2.1 Power evaluation for embedded systems ........................................................................ 4 
2.1.1 Wattch: an architectural-level power analysis tools ............................................... 5 
2.2 Cache prefetching ............................................................................................................ 7 
3 POWER AND PERFORMANCE TRADEOFFS USING VARIOUS CACHE 
CONFIGURA TIONS ........................................................................................................ 10 
3.1 Experimental methodology ........................................................................................... 10 
3.2 Benchmark programs .................................................................................................... 12 
3.3 Results and discussions ................................................................................................. 12 
4 POWER AND PERFORMANCE TRADEOFFS USING DATA CACHE 
PREFETCHIN G ............................................................................................................... 20 
4.1 Cache pollution from prefetched data ........................................................................... 20 
4.2 Fixed Prefetch Block for prefetched data ...................................................................... 21 
4.3 Experimental methodology ........................................................................................... 24 
4.4 Results and discussions ................................................................................................. 27 
5 CONCLUSIONS ............................................................................................................... 36 
RE FE REN CES ...................................................................................................................... 38 
ACKNOWLEDGMENTS .................................................................................................... 42 
lV 
LIST OF FIGURES 
Figure 1: The overall structure of W attch simulator ............................................................... 6 
Figure 2: Cache pollution in a four-set associative cache ..................................................... 21 
Figure 3: Fixed Pre fetch Block scheme ................................................................................ 22 
Figure 4: Four-way set associative cache with Fixed Prefetch Block. ................................. 24 
Figure 5: Percent miss rate reduction for 512 bytes data cache size ..................................... 28 
Figure 6: Energy-Delay product for 512 bytes data cache size ............................................ 29 
Figure 7: Percent miss rate reduction for 256 bytes data cache size ..................................... 31 
Figure 8: Percent miss rate reduction for 2 Kbytes data cache size ...................................... 32 
Figure 9: Energy-Delay product for 256 bytes data cache size ............................................ 33 
Figure 10: Energy-Delay product for 2 Kbytes data cache size .............................................. 33 
V 
LIST OF TABLES 
Table 1: Mediabench benchmarks description ...................................................................... 12 
Table 2: Energy-Delay product for 128 bytes caches relative to base machine ................... 15 
Table 3: Energy-Delay product for 256 bytes caches relative to base machine ................... 16 
Table 4: Energy-Delay product for 512 bytes caches relative to base machine ................... 17 
Table 5: Energy-Delay product for 1 Kbytes caches relative to base machine .................... 18 
Table 6: Energy-Delay product for 2 Kbytes caches relative to base machine .................... 19 
Table 7: Average power, performance, ED, and miss rate for different cache size (with 2-
way set associative and 16 bytes line size) relative to the base machine ................ 19 
Table 8: Percentage of pollution from prefetched data and hit on prefetched block for 512 
bytes cache size ....................................................................................................... 30 
Table 9: Percentage of pollution from prefetched data and hit on prefetched block for 256 
bytes cache size ....................................................................................................... 35 
Table 10: Percentage of pollution from prefetched data and hit on prefetched block for 2 
Kbytes cache size .................................................................................................... 35 
1 
1 INTRODUCTION 
As portable communications and multimedia devices become more commonplace, 
low power embedded processors should be designed to meet the needs of these applications. 
Often designers of modem microprocessors have focused on high speed rather than low-
power dissipation. However, in portable devices, a low-power system design has become an 
important issue. Several techniques have been proposed to reduce the energy of various 
system components. These include both hardware and software oriented techniques. Many of 
these efforts [11, 13, 25] have focused on the memory subsystem which has been found to be 
a major energy consumer of the entire system. For example, on-chip caches of the DEC 
21164 microprocessor consume 25% of the total chip power [11], and on-chip caches of 
modem embedded RISC microprocessors such as the StrongARM SA 110 [ 17] and the 
PowerPC from IBM [2] consume the largest or the second largest of total chip power. 
Clearly, the cache is one of the major targets for power reduction. Decreasing cache size 
leads to reducing power consumption because a smaller cache has less capacitance from the 
bit array size as well as smaller drivers in decoder or peripheral circuitry [26]. Unfortunately, 
performance also decreases due to a lower cache hit rate when the cache size becomes 
smaller. For example, a direct mapped 256 bytes filter cache achieves a 58% power reduction 
while reducing performance by 21 %, corresponding to a 51 % reduction in the energy-delay 
product over a conventional design [13]. In order to reduce power while retaining 
performance of a small cache, there is a need to explore techniques for increasing the hit rate. 
Prefetching has been shown to be an effective way to improve hit rate without 
affecting hit access time of the cache. The idea of prefetching is to predict data access needs 
2 
in advance so that specific data is loaded from the main memory into the cache before it is 
actually needed. In this way, both cache misses and the overall execution time are reduced. 
Many techniques for instruction and data cache prefetching have been proposed. These 
techniques include both hardware and software oriented techniques [ 14]. Although many 
software techniques [4, 18, 21, 22] clearly have a cost advantage over hardware techniques, 
they require additional overhead to the applications. Additional time must be spent to execute 
the prefetched instructions. Furthermore, software prefetching techniques must be optimized 
for a given memory architecture and implementation. Therefore hardware techniques have 
better dynamic information that can recognize memory access patterns, such as unexpected 
cache conflicts that are difficult to predict by the compiler. While previous research has 
focused on instruction prefetching between cache and memory, recent studies have 
investigated how to prefetch for data cache [5, 16]. However, data reference patterns show 
less spatial locality than instruction references. This makes it difficult to predict future data 
reference patterns. Thus if we prefetch a large amount of unnecessary data into cache, the 
cache may get polluted, resulting in a waste of cache space and unnecessary consumption of 
data bandwidth. With multimedia applications, the cache pollution seems worse because of 
the "stream" nature of multimedia data. A split cache organization, which separates 
prefetched data from regular-fetched data, may be utilized to prevent cache pollution [27]. 
This technique of having a separate cache called a stream buffer is an add-on to an existing 
cache to handle multimedia applications. Obviously, the advantage of this technique is that 
the prefetched data cannot create cache pollution. In addition, prefetching can significantly 
improve the cache hit rate due to high spatial locality of multimedia data. However, this add-
on separate cache requires additional hardware components resulting in increased power 
3 
consumption. In an embedded system, it is not sufficient to consider only the cache hit rate 
improvement from prefetching. This is because while the miss rate reduces with prefetching, 
the power consumption is always increasing. Thus, it is important to study the tradeoffs 
between performance and power consumption of prefetching in embedded systems. 
While several studies have studied prefetching strategies for improving hit rate, no 
published work has studied the power consumption of hardware prefetching schemes. This 
thesis studies the tradeoffs between performance and power consumption when prefetching is 
implemented. To show the tradeoffs between power consumption and performance using 
cache prefetching, simulation results from various benchmark programs using a cycle-by-
cycle simulator are presented and compared. In addition, to eliminate the power consumption 
from add-on separate cache while cache pollution is reduced, a new technique called "Fixed 
Pre/etch Block" (PPB) is proposed. To show the effectiveness of the PPB technique, 
simulation results are also analyzed and compared with other prefetching schemes. 
The rest of this thesis is organized as follows. The chapter 2 summarizes previous 
related work on power evaluation for embedded systems and data cache prefetching 
techniques. Chapter 3 studies the tradeoffs between power and performance using various 
cache configurations. While chapter 4 studies the tradeoffs between power and performance 
using cache prefetching. Conclusions and suggestions for future work are summarized in 
chapter 5. 
4 
2 RELATED WORK 
The related previous work can be divided into two major topics: power evaluation for 
embedded systems and cache prefetching techniques. 
2.1 Power evaluation for embedded systems 
Since a major source of power dissipation in an embedded system is cache memory, 
prior research has been aimed at measuring and recommending optimal cache configuration 
for power. There exist several energy models for caches. Kamble and Ghose [12] have 
developed an analytic model for power consumption in various cache schemes. Their model 
combines memory traffic, process features such as capacitance, and architectural factors 
including cache line size, set associative, and capacity. The process models are based on 
measurements reported by Wilton and Jouppi [26] for a 0.8 um process technology. Su and 
Despain [24] have developed an intuitive and simple model for power consumption due to 
hits in a cache. They evaluated the effectiveness of a number of low power cache structures. 
Block buffering involves latching the last cache line to be accessed and doing an early match 
against the buffer for each reference. Sub-banking involves breaking a large array into 
independent banks and only powering a cache bank if it matches certain address bits. Hicks, 
Walnock and Owens [10] extended this model and considered the energy consumption due to 
cache misses as well. While these models have mainly focused on a cache model for power 
consumption, there has been prior work on architecture-level power estimation tools. For 
example, Chen et. al. [5] have developed a system-level power analysis tool for a combined, 
5 
single-chip 16-bit DSP and 32-bit RISC microprocessor. In this model, capacitance data was 
generated from switch-level simulation of function unit designs. David et al. [3] have 
developed "Wattch" an architectural simulator that estimates CPU power consumption. Their 
power estimations are based on a suite of parameterizable power models for different 
hardware structure and on per-cycle resource usage counts generated through cycle-level 
simulation. In this thesis we use Wattch as a power analysis tool for evaluating our new 
design. The detail of the W attch simulator is explained as follow. 
2.1.1 Wattch: an architectural-level power analysis tools 
Wattch is an architectural simulator that estimates CPU power consumption. The 
foundation for power modeling infrastructure of this simulation are parameterized power 
models of common structure present in modem superscalar microprocessors. These power 
models are integrated into the Simplescalar architectural simulator [ 1]. Figure 1 shows the 
overall structure of W attch and the interface between the performance simulator and the 
power models. 
Power consumption in CMOS has three components: dynamic power consumption, 
short circuit current power consumption, and static power consumption. With proper circuit 
design techniques, the latter two components can be reduced and are negligible compared to 
the dynamic power consumption [24]. The power model for the main processor can be 
divided into four categories: 
6 
• Array structures: Data and instruction caches, cache tag arrays, all register files, 
register alias table, branch predictors, and large portion of the instruction window 
and load/store queue. 
• Fully associative Content-Addressable Memories (CAM): instruction 
window/reorder buffer wakeup logic, load/store order checks, and TLBs, for 
example. 
• Combinational logics and wires: functional units, instruction window selection 
logic, dependency check logic, and results buses. 









simulator Cycle-by-cycle Perfo Binary Hardware access counts esti mates - ; 
Figure 1: The overall structure of Wattch simulator. 
In CMOS microprocessors, dynamic power consumption (Pd) is the main source of 
power consumption, and is defined as 
I Pd= cvd~af I 
Where 
C is a load capacitance 
Vdd is a supply voltage 
f is a clock frequency 
7 
a is a fraction between O and 1 indicating how often clock ticks lead 
to switching activity and average. 
Vdd and f depend on the assumed process technology; 0.35 um is assumed in the 
Wattch simulator [20]. The activity factor is related to the benchmark programs being 
executed. For circuits that pre-charge and discharge on every cycle, an a of 1 is used. The 
activity factors for certain critical sub-circuits are measured from the benchmarks programs 
using the architectural simulator. For sub-circuits in which we are unable to measure activity 
factors with the simulator, we assume a base activity factor of 0.5 (random switching 
activity). The Wattch simulator estimates C based on the internal capacitances for the 
circuits that make up the processor. It estimates using assumption similar to those made by 
Wilton and Jouppi [26]. 
2.2 Cache prefetching 
During the past few years, research efforts have been spent on data prefetching. A 
number of different prefetching techniques have been proposed. Prefetching techniques can 
be classified into two types based on characteristics of the prefetched candidates that a 
technique targets: non-selective prefetching and highly selective prefetching [ 4]. Typical 
non-selective prefetching techniques include always-prefetching and prefetch-on-miss [9, 
8 
23]. For always-prefetching, every time there is a reference to block i, the cache is examined 
for block i+ 1 (the next sequential block). If the block i+ 1 is absent in the cache, it is 
prefetched. Unlike always-prefetching, prefetch-on-miss technique considers issuing a 
prefetch only if the current reference results in a cache miss. One-block-look ahead (OBL) 
[23] is an example of this technique. When a cache miss happens, instead bringing only a 
requested data (block i) from main memory into a cache, it also brings a prefetched data 
(block i+ 1) into the cache. The Stream buffer [11] was introduced later as an extension of the 
OBL idea. With a stream buffer, blocks i+l, i+2, ... , i+n are brought into the stream buffer 
when a cache miss happens. Thus, this produces less demand on the cache, and requires 
fewer memory cycles than always-prefetching. However, both always-prefetching and 
prefetch-on-miss techniques do not perform accurately enough to give substantial 
improvement in cache performance. 
In contrast, highly selective prefetching techniques have been proposed to improve 
accuracy of prefetching by limiting the number of unnecessary prefetched data into a cache. 
Palacharla and Kessler [ 19] developed a filter scheme and a method for allowing a variable-
length stride for enhancing the stream buffer. Fu and Patel [8] developed a prefetching 
technique, which uses the Stride Prediction Table (SPT) to calculate the stride distance of 
data references for an instruction. The SPT is a cache containing data addresses previously 
referenced. When a memory accessing instruction is issued, the SPT is searched to find the 
previously referenced data address. If such a data address is found, the stride of the data 
reference for this instruction is then calculated by subtracting the previous referenced data 
address from the data address referenced by the current instruction. Finally, a data prefetch 
whose address is the sum of the current address and the stride distance is issued. Chen and 
9 
Baer [ 4] proposed a similar structure called the Reference Prediction Table (RPT). This 
scheme connects future data references with previously executed memory-referencing 
instructions. The stride distance of data references is also determined by calculating the 
difference between the data address of the previous reference and the data address of the 
current reference. In addition, their technique includes state bits to maintain the characteristic 
of each operation. Although most of the highly selective prefetching techniques do a good 
job of predicting which data to prefetch and work well for large cache sizes, they do not 
perform well in smaller cache sizes because they prefetch a large amounts of unnecessary 
data, which pollutes the cache. To further improve the performance of the highly selective 
prefetching techniques, there is a need to reduce cache pollution for small cache size. 
Therefore in order to select a suitable and effective cache configurations, it is important to 
know the tradeoff s between the power and performance for cache configurations before we 
incorporate prefetching. In the next chapter, we will evaluate the power and performance of 
different cache designs. 
10 
3 POWER AND PERFORMANCE TRADEOFFS USING VARIOUS CACHE 
CONFIGURATIONS 
In this chapter, we investigate how each cache configuration affects the power and 
performance by using a detail, architectural-level simulator and a set of benchmark programs. 
To compare effectiveness of each configuration in embedded processors, power consumption 
must be evaluated along with performance. We have decided to use the product of energy 
and delay as the metric for our evaluation and will refer this to "ED". However, in this thesis, 
we consider only data cache. Thus, the ED that we will consider here is the ED for data 
cache. Based on the results obtained from simulations, we will determine the ED of each 
cache configuration. Then we will select the cache configurations that have a minimum value 
of the ED to serve as a reference cache design for discussion in the chapter 4. 
3.1 Experimental methodology 
We use an approach of comparing a set of different cache configurations to a 
traditional processor and cache, which will be called base machine. The base machine model 
used here is an embedded system, which is designed to be roughly comparable to the Intel 
StrongARM 110 in terms of system resources [17]: 16 KB instruction and data caches (32 
byte line size with direct mapped), single issue processor, no L2 cache. Applications were 
compiled and executed using W attch, an extended version of Simplescalar toolsets. This 
approach allows us to experiment with various cache structures and generate accurate clock 
cycle counts for execution time. These counts are used directly for delay in the ED metric. 
11 
The EDs were determined for the base machine model executing each application in the 
experimental workload and subsequently used to evaluate alternative cache configurations. 
As we mentioned earlier, Wattch estimates C based on the internal capacitances for 
particular circuits. Hence we are only interested in data cache power consumption. Here we 
only discuss in detail how Wattch determines cache power dissipation. Cache is classified as 
one of the array structures in W attch. For the array structures, W attch models the power 
consumption on the following stages: decoder, word-line drive, bit-line discharge and output 
drive (sense amplifier). Thus for a cache, Wattch first estimates capacitances for each stage 
and then using these capacitances to determine power consumption. Finally the power 
consumption of all stages will be analyzed and summed together. However, to estimate the 
capacitance for each stage, the array structure power model is parameterized based on the 
number of rows (entries), columns (width of each entry), and the number of read/write ports. 
These parameters affect the size and number of decoders, the number of word-lines, and the 
number of bit-lines. In addition, these parameters are used for estimating the length of pre-
decode wire as well as the lengths of the array's word-lines and bit-lines. Note that in this 
thesis we compute the number of rows and columns using the help of Cacti tool [26]. The 
Cacti is a tool developed to determine delay-optimal cache configurations given cache 
parameters such as cache size, block size, and associativity. In order to obtain minimum 
delay, Cacti kept the arrays as square as possible by configure the tag array independently of 
the data array. This is because the delay or the accesses time of a cache is directly 
proportional to the capacitance driven and to the length of bit-lines or word-lines. 
12 
3.2 Benchmark programs 
The simulation workloads are based on five Mediabench [15] benchmarks suite. All 
the programs were compiled using GCC compiler that generates code in the portable ISA 
(PISA) format. The programs from Mediabench benchmarks represent the workloads for a 
variety of emerging multimedia and communication applications. These applications are 
commonly used in personal telecommunications and PDA devices. All of the simulations 
were performed on a Linux/RedHat Intel platform. The descriptions of these applications are 
shown in Table 1 as follows. 
Table 1: Mediabench benchmarks description. 
APPLICATION DESCRIPTION 
3.3 Results and discussions 
For our base machine, we assumed a single level cache structure with split instruction 
and data caches each with a capacity of 16KB. These caches are direct mapped with a line 
size of 32 bytes each. Based on the Stanford cache design tool [7] using 0.35 um technology 
and six transistors cells. It suggests that a simple direct mapped cache has around 30% 
reduction in access time over 2- or 4-way associative cache. Thus for our simulation, we 
13 
assume the hit access time for direct mapped cache is one clock cycle while 2- or 4-way 
associative cache is two clock cycles. We have considered a small cache size: 128 bytes, 256 
bytes, 512 bytes, 1 Kbytes, and 2 Kbytes to look at the effect of ED with different line size 
and associative. For each size, the line size was varied between 8, 16, and 32 bytes. The 
instruction cache was simulated with the same configuration as the base machine. The results 
for each cache configuration are presented relative to the base machine. The EDs for the 
Mediabench benchmarks are shown in Table 2 to 6 for 128 bytes, 256 bytes, 512 bytes, 1 
Kbytes, and 2 Kbytes cache sizes, respectively. 
First we considered the impact of line size. Based on the average values of ED from 
Table 2 to 6, 16 bytes line size performs better than 8 and 32 bytes line sizes in the most 
cases except one case. This particular case corresponds to the smallest cache size and 
smallest line size (128 bytes cache size, 8 bytes line size). However, the ED improvement is 
still very small when we compared with 16 bytes line size. The ED improves only 3% 
(0.3969 - 0.3647 = 0.0322). Because of this, we will consider only 16 bytes line size caches. 
Now we focus on the ED for different associative caches with 16 bytes line size. In many 
cases, 2-way set associative performs better than 4-way associative, while direct mapped 
performs worse in the most cases. Again, in the cases that 4-way set associative cache 
performs better than 2-way associative cache, only 5% of the ED improvement in maximum 
over 2-way set associative cache. Thus we will no longer consider the direct mapped or the 
4-way set associative caches. Up to this point, we have investigated the impact of the line 
size and associativity on the ED. The remaining factor to be explored is cache size. Table 7 
highlights the impact of varying the cache size while the line size and associativity are 16 
bytes and 2-way, respectively. It appears that the minimum ED occurs when the cache size is 
14 
512 bytes. Thus it is reasonable to conclude that 512 bytes with a 2-way set associative cache 
and 16 bytes line size is the best choice for our experiment. With this cache configuration, we 
can see that power is reduced by 72% while performance is degraded by 30%. This tradeoff 
results in a 37% reduction in the ED when compared to the base machine. Therefore if we 
look at the miss rate from the Table 7, it shows that the miss rate is increased 10 times when 
compared with the base machine. Based on this result, it is clear that to improve the 
performance there is a need to improve the miss rate. From previous studies, prefetching has 
been shown to be an effective way to reduce miss rate. Thus to improve the miss rate in our 
research, prefetching will be incorporated with the selected cache configuration from this 
chapter. 
In the next chapter we will analyze various uses of data prefetching techniques to 
improve cache miss rate. In addition, we will compare both power and performance impact 
of using these prefetching techniques. 
15 
Table 2: Energy-Delay product for 128 bytes caches relative to base machine. 
cache size= 128 bytes 
Applications 8 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5094 0.4624 0.4904 
epic 0.3817 0.3290 0.3590 
m peg2decode 0.4430 0.4086 0.4390 
mpeg2encode 0.3579 0.2880 0.3030 
pegw itdecode 0.3997 0.3790 0.3991 
pegw itencode 0.3365 0.3211 0.3332 
Average 0.4047 ·"'w~ 0.3873 
16 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5482 0.4984 0.5418 
epic 0.3855 0.3496 0.4242 
mpeg2decode 0.5089 0.4699 0.5315 
mpeg2encode 0.3665 0.2722 0.3002 
pegw itdecode 0.4385 0.4353 0.4825 
pegwitencode 0.3513 0.3560 0.3959 
Average 0.4332 iy 1/11:n;i,Pk '~ •Ji,;;0 .3969 0.4460 
32 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.8394 0.7501 0.9241 
epic 0.4925 0.4348 0.5606 
mpeg2decode 0.6965 0.6583 0.7509 
mpeg2encode 0.4882 0.3197 0.3659 
pegw itdecode 0.6258 0.6327 0.7314 
pegwitencode 0.4743 0.5003 0.5759 
Average 0.6028111Ul!!!!!H , ... ~ 0.6515 
16 
Table 3: Energy-Delay product for 256 bytes caches relative to base machine. 
cache size= 256 bytes 
Applications 8 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5391 0.4729 0.4651 
epic 0.4075 0.3426 0.3408 
mpeg2decode 0.5134 0.4395 0.4523 
mpeg2encode 0.3841 0.2985 0.3064 
pegw itdecode 0.4399 0.3856 0.3925 
pegw itencode 0.3668 0.3275 0.3297 
Average 0.4418 As/J) ,,f ' . '11fijll~ffl,:;~l78 0.3812 
16 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4815 0.4455 0.4666 
epic 0.3672 0.3409 0.3658 
mpeg2decode 0.5268 0.4663 0.5203 
mpeg2encode 0.3389 0.2616 0.2902 
pegw itdecode 0.4301 0.4125 0.4536 
pegwitencode 0.3495 0.3383 0.3814 
Average 0.4157 ' ~ 0.4130 y 
32 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.6252 0.5614 0.5873 
epic 0.4223 0.4034 0.4679 
mpeg2decode 0.6699 0.5940 0.7166 
mpeg2encode 0.4000 0.2885 0.3495 
pegwitdecode 0.5469 0.5499 0.6501 
pegw itencode 0.4322 0.4462 0.5302 
Average 0.5161 ::~ ,, -~ 0.5503 ..,._ 
17 
Table 4: Energy-Delay product for 512 bytes caches relative to base machine. 
cache size= 512 bytes 
Applications 8 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5966 0.5245 0.4761 
epic 0.4918 0.4034 0.3799 
m peg2decode 0.4650 0.4208 0.4483 
mpeg2encode 0.4119 0.3144 0.3051 
pegw itdecode 0.5175 0.4330 0.4033 
pegw itencode 0.4323 0.3583 0.3327 
Average 0.4858 0.4091 - -.......... •. 
16 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4644 0.4406 0.4399 
epic 0.4104 0.3619 0.3868 
mpeg2decode 0.4197 0.3846 0.4638 
mpeg2encode 0.3497 0.2624 0.2859 
pegwitdecode 0.4646 0.4221 0.4479 
pegw itencode 0.3849 0.3546 0.3749 
Average 0.4156 ,h,-1!;!!1'· .. -- @OS< ~710 0.3999 
32 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4647 0.4757 0.5122 
epic 0.4117 0.3984 0.4670 
m peg2decode 0.5060 0.4699 0.6272 
mpeg2encode 0.3718 0.2845 0.3364 
pegw itdecode 0.5179 0.5255 0.6061 
pegwitencode 0.4312 0.4328 0.5019 
Average 0.4505 rn-----,-J;n;___ + :- ~ 0.5085 
18 
Table 5: Energy-Delay product for 1 Kbytes caches relative to base machine. 
cache size = 1 Kbytes 
Applications 8 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5229 0.5698 0.4661 
epic 0.5242 0.5256 0.4678 
m peg2decode 0.4869 0.4092 0.3716 
mpeg2encode 0.4221 0.3308 0.2958 
pegwitdecode 0.5104 0.5223 0.4609 
pegwitencode 0.4322 0.4448 0.3917 
Average 0.4831 0.4671 :li1:!ciill'i/;Jil,, " 
Fi%/liTuiU!!ffQ:'~(;}9Q 
16 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4976 0.4611 0.4200 
epic 0.5098 0.4406 0.4360 
mpeg2decode 0.5074 0.3420 0.3473 
mpeg2encode 0.4264 0.2734 0.2703 
pegw itdecode 0.5325 0.4718 0.4621 
pegw itencode 0.4469 0.3974 0.3845 
Average 0.4868 0.3977 I 
' 
32 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4568 0.4519 0.4778 
epic 0.4687 0.4418 0.4999 
m peg2decode 0.5637 0.3454 0.4066 
mpeg2encode 0.4280 0.2735 0.3053 
pegw itdecode 0.5470 0.5290 0.5940 
pegw itencode 0.4556 0.4425 0.5020 
Average 0.4867 /,,,,/,), .rn.o, 'l!L!t, AO 0.4643 
19 
Table 6: Energy-Delay product for 2 Kbytes caches relative to base machine. 
cache size = 2 Kbytes 
Applications 8 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.7437 0.8108 0.5749 
epic 0.7905 0.8090 0.6207 
mpeg2decode 0.7017 0.6150 0.4615 
mpeg2encode 0.6395 0.4958 0.3714 
pegw itdecode 0.7007 0.7305 0.5534 
pegwitencode 0.6103 0.6311 0.4862 
Average 0.6977 0.6820 ,. "' j( '""""""'"""w,,,.•n 0.5114 
16 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.5178 0.5658 0.4795 
epic 0.5663 0.5761 0.5378 
m peg2decode 0.5296 0.4375 0.3946 
mpeg2encode 0.4786 0.3502 0.3189 
pegw itdecode 0.5348 0.5525 0.5100 
pegw itencode 0.4653 0.4797 0.4467 
Average 0.5154 0.4936 l';@!;;}in " 0.4479 
32 bytes 
Direct mapped 2-way set associative 4-way set associative 
cjpeg 0.4482 0.5152 0.4854 
epic 0.4994 0.5391 0.5673 
mpeg2decode 0.5587 0.4084 0.3937 
mpeg2encode 0.4484 0.3215 0.3185 
pegw itdecode 0.5197 0.5668 0.5912 
pegwitencode 0.4405 0.4968 0.5133 
Average 0.4858 >'',: ,,,,cc;; "'"''" '111,11-. 0.4782 '"' "''' 
Table 7: Average power, performance, ED, and miss rate for different cache size ( with 
2-way set associative and 16 bytes line size) relative to the base machine. 
cache size (bytes) performance power ED miss rate 
128 1.4841 0.2649 0.3969 18.8626 
256 1.3997 0.2673 0.3775 15.9402 
512 1.3018 0.2833 0.3710 10.3443 
1024 1.2008 0.3285 0.3977 5.3654 
2048 1.1535 0.4250 0.4936 3.9670 
20 
4 POWER AND PERFORMANCE TRADEOFFS USING DATA CACHE 
PREFETCHING 
In this chapter, we investigate how data cache prefetching affects the power and 
performance. Prefetching is an attractive technique to improve cache hit rate. This is because 
prefetching improves the hit rate without affecting access time of the cache. However, 
prefetching can degrade performance by displacing useful data, thus cause new cache misses 
(cache pollution). In addition, for hardware prefetching, extra components need to be 
incorporated with the cache resulting in increased power consumption. To eliminate cache 
pollution from prefetching, the data should not prefetch directly to the main cache. Hence 
this requires an additional space (called buffer) to keep prefetched data resulting in even 
more power consumption in the system. To eliminate extra power from the additional buffer, 
we introduce a new scheme that determines which block of a main cache should be replaced 
with prefetched data in order to minimize cache the pollution problem. Our technique focuses 
on the problem of interference between the spatial locality of prefetched data and the 
temporal locality of demand-fetched data. The technique enhances the effectiveness of 
prefetching on small set associative caches. To evaluate the effectiveness of our scheme, the 
power and performance will be compared against the reference cache from previous chapter 
and other schemes, which will be discussed later in this chapter. 
4.1 Cache pollution from prefetched data 
Before we go into the detail of our technique, we briefly describe how prefetched data 
creates pollution. Consider a four-way set associative cache with the least recently used 
21 
replacement (LR U) policy in Figure 2a. Whenever cache misses happen, the LR U block is 
the first priority to replace with a new data reference that is transferred from main memory. If 
a prefetching technique is implemented to reduce the miss rate, the prefetched data has to 
replace one block of the set as shown in Figure 2b. Since all prefetched data are placed into a 
cache, the working set of demand-fetched data in cache will be disturbed by data prefetching. 
Especially on a smaller cache with highly selective prefetching, the prefetched data may 
occupy all of the blocks (Figure 2c ). Eventually, cache pollution may remove all the demand-
fetched data that have high temporal locality. 
Set 
~I LRU LRU a) 
~I LRU LRU p p b) 
~I p LRU p p p p p c) LRU 
Figure 2: Cache pollution in a four-set associative cache. 
4.2 Fixed Pref etch Block for pref etched data 
The previous section shows that cache pollution is created from prefetched data. To 
reduce this pollution, a "Fixed Pre/etch Block" (FPB) scheme is proposed. Its purpose is to 
reduce the interlerence between the prefetched data and the demand-fetched data in cache. 
Since the highly selective prefetching technique exploits memory access patterns from spatial 
locality of data to calculate stride distance. This stride is used to predict a prefetched data 
22 
address. Our study found that the prefetched data has high spatial and low temporal locality. 
Even if prefetched data has a chance to be reused, it is assumed that data can be prefetched 
accurately again in time for their future use. For this reason, it is not necessary to keep the 
prefetched data in the cache for a long period as it disturbs the demand-fetched data. If 
prefetched data exists in the cache, new prefetched data should replaced the current 
prefetched data only. So, cache pollution from prefetched data can be reduced. This strategy 
also utilizes all cache efficiently unlike separate stream buffer, which wastes cache when 




Prefetched data with 





Pref etched data with 
LRU block in that set. 
UpdatedLRU 
INFO table 
a) Processor accesses data from cache b) Cache prefetches data from memory 
Figure 3: Fixed Pref etch Block scheme. 
Accessing cache in our scheme is separated into two cases: normal data access 
(Figure 3a) and access for prefetching from main memory (Figure 3b ). For normal data 
access from cache, the data from main memory are replaced with LRU block of the cache on 
a cache miss. In contrast, while the cache prefetches data from the main memory, if the data 
to be prefetched is already in the cache (prefetch hit), then it does not do anything. If the data 
23 
to be prefetched is not in the cache (prefetched miss), then new data is prefetched from main 
memory into the cache. If any previously prefetched data exists in the cache, a new block of 
prefetched data will be placed only on the same block that contains the previous prefetched 
data. Otherwise, the LR U block in the set should be replaced. Hence, our scheme provides 
only one block for prefetching per each set. 
As we explained how this technique works, it should be obvious that this FPB scheme 
is easy to implement in a set associative cache. To be able to identify the prefetched block in 
a cache set, it only requires attaching P-tag bits to the cache set. Figure 4 shows a block 
diagram of a four-way set associative cache integrated with the FPB scheme. The LRU INFO 
table contains the information of replacement order while P-INFO bits indicate prefetched 
block status for each set, respectively. The P-INFO bits consist of P-loc bits and a P-tag bit 
for the location and existence of prefetched data. Initially, P-loc bits indicate the LRU block 
while the P-tag bit is set to "O'', meaning that there is no prefetched data in the cache. When 
read operation is to be performed, processor places data address on the address bus. The most 
significant bits are a tag field while the least significant bits are an index field. The index 
field points to a cache line with the LRU INFO bits, P-loc bits and P-tag bit in each cache set. 
When read operation is performed, a tag value is read from each cache block. These four tag 
values are compared with the tag field of the address bus. If a match is found, a cache hit has 
occurred in that block. If data requested by the processor is not found in the cache, a cache 
miss happens. Then, the processor reads the data from the main memory. This data replaces 
the LRU block. The LRU INFO table keeps track of the replacement order, while P-loc bits 






LRU INFO P-loc P-tag I 
'-----L------'---------'----· ..... ~ :: :::::::~: ~ " ; ; ~~ :r~=::_:~ : 
! i i I I I 
I I I I I 
.......................... , 1···1······0 ····· ··..0 '--~--L--~-''-~-_,_ __ ___. I I 
Associativity 
-----· 
Figure 4: Four-way set associative cache with Fixed Prefetch Block. 
Instead of processor placed data address on the address bus, a prefetched address is 
placed on the address bus during prefetching time. If the data to be prefetched is in the cache 
already, nothing needs to update. In contrast, when data to be prefetched is not in the cache, 
the LRU block is replaced with the new data block if the P-tag bit is set to "O". Then, the P-
tag bit is changed to "1". Otherwise, the block indicated by P-loc bits should be replaced. If 
the prefetched block is replaced with demand-fetched data, P-tag is reset to "O". Hence, our 
scheme needs three extra bits, two bits for P-loc and one bit for P-tag, to incorporate with a 
four-way set associative cache. 
4.3 Experimental methodology 
Our scheme is used to enhance the effectiveness of prefetching techniques. To 
evaluate the effectiveness of our scheme, a prefetching technique needs to be implemented 
and combined with our scheme. Our study found that Stride Prediction Table (SPT) is one of 
25 
the most effective prefetching techniques. Results from their research have shown that SPT 
works very well only on a large cache size. But for a smaller cache size, the SPT prefetched a 
large amount of unnecessary data and pollutes the cache. We implement our scheme into the 
SPT prefetching technique to minimize the pollution due to prefetching. To evaluate the 
effectiveness of our scheme, we developed three machine models for data cache prefetching 
from the sim-outorder in Wattch simulator, an extension version of the Simplescalar toolset. 
These models include 1) data cache prefetching with least recently used (LRU), 2) data cache 
prefetching with stream buffer (STREAM), and 3) our scheme, data cache prefetching with 
FPB (FPB). The ED and the miss rate of these models are measured and compared against 
the reference machine model. In addition, these three models insert additional components 
into the system, thus an extra power consumption model needs to be inserted into the 
simulator. The modifications to the power simulator infrastructure for the new hardware were 
not complex. All machine models need the SPT table and one adder to calculate the 
prediction address except that the STREAM machine model requires additional cache for 
performing stream buffer. Note that we neglect the power consumption of the P-INFO table 
from our scheme. This is because this table consumes a very small amount of power when 
compared with the cache or SPT table. The SPT table was modeled as a simple array 
structure using the same power models that the other array structures use while the additional 
adder using the same power model as functional units use. Similar to the base machine model 
from the chapter 3, the reference machine used here has 16 KB instruction with a 32 byte line 
size and direct mapped, single issue, and with no L2 cache. Except that we fixed the data 
cache size at 512 bytes with 2-way set associative and 16 bytes line size for all simulations. 
This cache was selected from the chapter 3 based on the minimum energy-delay product. To 
26 
make fairly comparison between the STREAM model and our scheme, the size of the stream 
buffer was varied from 16, 32, 64, 128, and 256 bytes. Later, the size of the stream buffer 
was chosen based on the miss rates that were close to our scheme. All applications were 
compiled and executed using the same set as the previous chapter. The SPT of fixed 8 entries 
was used for all simulations. This number was chosen to minimize the pollution from 
prefetched data. Since more SPT entries would provide a larger number of prefetched data, it 
would cause more cache pollution in the small cache size. To make our comparison easier to 
understand, a measure of miss rate reduction is introduced. Miss rate reduction is defined as 
the percentage of a miss rate reduced when we compare with a miss rate from the base 
machine. Throughout this paper, the miss rate reduction of FPB, LRU, and STREAM models 
are calculated from the equation below: 
. . ( Miss ratereferencemachine -Miss rateFPB )*100 
Miss rate reduction FPB (%) =-----------------
Miss rate reduction LRU (%) 
Miss rate reference machine 
( Miss ratereferencemachine - Miss rateLRU ) * 100 
Miss rate reference machine 
. . ( Miss rate reference machine - Miss rate STREAM ) * 100 Miss rate reduction STREAM (%) = ____ ___;; ______________ _ 
Miss rate reference machine 
In some cases, if prefetched data creates more pollution than improvement in the 
cache, the percent miss rate reduction is a negative value. This is because the miss rate from 
the reference machine is lower than the machines that implement prefetching. Note that for 
27 
this thesis, a demand miss has higher priority to access the main memory than a prefetch 
miss. Even if prefetching is accessing the main memory, the prefetching will be aborted by 
the demand miss. Up to this point, it is not clear that how prefetched data improves the miss 
rate. We have seen that only the miss rate reduction is improved when we implement 
prefetching. To show that how the prefetched data improves the miss rate, percentage of 
cache pollution from prefetched data, and percentage of cache hit on prefetched block are 
calculated. 
The percentage of pollution from prefetching is calculated from the equation below: 
( # cache hit without pref etching - # cache hit with pref etching on regalar block ) * 100 
# cache hit without pre/etching 
below: 
While, the percentage of cache hit on prefetched block is calculated as the equation 
# cache hit with pref etching on pref etched block * 100 
# cache hit without prefetching 
4.4 Results and discussions 
In this chapter, we present the results obtained from the simulation methodology 
described above. Each benchmark was simulated and measured for its miss rate, 
performance, and power consumption. Figure 5 shows the miss rate reduction of three 
different machine models for each of the Mediabench benchmarks, while Figure 6 shows the 
28 
ED of these three machine models. As we can see from the Figure 5, it is not clear that which 
prefetching model performs the best. For the cjpeg benchmark, the LRU model performs the 
best (4.2 %) while the FPB performs the best for the epic (33.1 %) and the STREAM 
performs the best for the mpeg2decode (9.8 %) and the mpeg2encode (4.8 %) benchmarks. 
Note that for the pegwitdecode and the pegwitencode benchmarks, all of the prefetching 
models did not improve miss rate at all. In fact, prefetching created more pollution to the 
cache for FPB and STREAM models. This is because the pegwitdecode and the 
pegwitencode are decryption and encryption programs in which memory references are 
random accesses. Consequently, the SPT did not capture the access pattern of these 
programs, it prefetched too many unnecessary data into the cache that made cache pollution. 
Overall simulation results have shown that the miss rate reduction of the STREAM model is 




n :::, e 15 +--------+---+--




ID FPS E:I LAU • STREAM I 
Figure 5: Percent miss rate reduction for 512 bytes data cache size. 
29 
On the other hand, if we look at the energy-delay product for these three prefetching 
models, we see that all of the prefetching models consume more power than the reference 
machine model (Figure 6). Especially the STREAM model consumes more power than the 
other two prefetching models. The epic benchmark consumes more than twice the power 
when compared with the reference machine model. This is because the STREAM model has 
an additional buffer for supporting prefetched data. On average, the LRU provides the 
minimum ED (60 % more than the base machine) for this data cache size. Note that for the 
six benchmark programs, the average miss rate reduction of our scheme is almost the same as 
the STREAM model while our scheme consumes less power. 
cjpeg epic pegwitdecode pegwitencode mpeg2decode mpeg2encode Average 
Benchmarks 
!oFPB i:::ILRU •STREM I 
Figure 6: Energy-Delay product for 512 bytes data cache size. 
30 
Table 8 shows that on average, our scheme creates less pollution than the LRU 
machine, while the STREAM model creates the least amount of pollution. This is because the 
STREAM model has the additional buffer for supporting prefetching. Therefore data in the 
main cache are not replaced with the prefetched data. In our scheme, we provide only one 
block of the cache set for prefetched data, while the LRU machine model provides more 
blocks for prefetched data. Eventually the data in the main cache is replaced with the 
prefetched data. However, to get benefit from prefetching, the percentage of pollution must 
be less than the percentage of the cache hit on prefetched data. As seen in Table 8, our 
scheme has a better trade-off in this regard. However, with this particular cache size, the ED 
from prefetching scheme is larger than the reference machine model without prefetching. 
This is because the miss rate improvement by using prefetching schemes is not able to 
compensate for the extra power resulting from hardware prefetching scheme. 
Table 8: Percentage of pollution from prefetched data and hit on prefetched block for 
512 bytes cache size. 
FPB 
Benchmarks Cache hit w/o prefetching Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
cioea 3856579.0000 3675336.0000 208865.0000 4.6996 5.4158 0.7162 
eolc 5778315.0000 5114734.0000 1248484.0000 11.4840 21.6064 10.122~ 
oeawitdecode 3698688. 0000 3582715. 0000 62335.0000 3.1355 1.6853 -1.4502 
oeawltencode 6335813. 0000 6004366.0000 112259.0000 5.2313 1.7718 -3.459' 
moea2decode 24728816.0000 21628224.0000 3168961.0000 12.5384 12.8149 0.276! 
moeg2encode 300946701.0000 295828637.0000 5307218.0000 1.7007 1.7635 0.062! 
Average 57557485.3333 55972335.3333 1684687.0000 6.4649 7.5096 1.044 
LRU 
Benchmarks Cache hit w/o prefetchlng Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on pretetch (%) Improvement(%) 
cloea 3856579.0000 3650188.0000 244980.0000 5.3517 6.3523 1.000f' 
eoic 5778315.0000 5778998.0000 1565.0000 -0.0118 0.0271 0.038! 
oeawitdecode 3698688.0000 3352521. 0000 345658. 0000 9.3592 9.3454 -0.013! 
oeawitencode 6335813.0000 5673175.0000 634464.0000 10.4586 10.0139 -0.444, 
moea2decode 24728816.0000 20902047.0000 3832945.0000 15.4749 15.4999 0.025( 
moea2encode 300946701.0000 295444288. 0000 6244496.0000 1.8284 2.0750 0.2461 
Average 57557485.3333 55800202.8333 1884018. 0000 7.0768 7.2189 0.1421 
STREAM32 
Benchmarks Cache hit w/o pretetching Hit on Regular block Hit on prefetch block Pollution from preletch (%) Hit on prefetch (%) Improvement(%) 
cloea 3856579.0000 3757591. 0000 150670.0000 2.5667 3.9068 1.3401 
eoic 5778315.0000 5191537.0000 1185057.0000 10.1548 20.5087 10.353 
oeawitdecode 3698688.0000 3588573.0000 110654.0000 2.9771 2.9917 0.014 
oeawltencode 6335813.0000 5956550.0000 250554.0000 5.9860 3.9546 -2.031 
mpeg2decode 24728816.0000 24231399.0000 166m2.oooo 2.0115 6.7442 4.732 
moea2encode 300946701.0000 297255403. 0000 5337064.0000 1.2266 1.7734 0.546! 
Average 57557485.3333 56663508.8333 1450295.1667 4.1538 6.6466 2.4921 
31 
To further investigate the effectiveness of our scheme, we also simulated our scheme 
with cache sizes that smaller or larger than 512 bytes. We used the same methodology as we 
described previously. Therefore the cache sizes that we implemented here are 256 bytes and 
2 Kbytes. Figure 7, and 8 show the miss rate reduction of three different machine models for 

















!oFPB L':ILRU •STREAM I 




::, e 10 +----------+--+--
s 




IDFPB EILRU •STREAM I 
Figure 8: Percent miss rate reduction for 2 Kbytes data cache size. 
As shown in Figure 7, in general, all prefetching schemes improve the miss rate with 
the average miss rate reduction of 6%, 2.39%, and 6.14%, for FPB, LRU, and STREAM 
models respectively. However, if we compare these values with the 512 bytes cache size, we 
see that 256 bytes cache size performs better due to the average miss rate reduction for 512 
bytes is less than 256 bytes. Figure 8 shows the miss rate reduction when the cache size is 2 
Kbytes, on the average our scheme performs the best with the average miss rate reduction at 
6.8% while the LRU and STREAM models achieve the average miss rate reduction at 0.8% 
and 2.5%, respectively. However, to see the tradeoffs between power and performance when 
the cache size is changed, the EDs for these two cache sizes were measured and are shown in 
Figure 9 and 10. 





lcFPB 19LRU •STREAM I 
mpeg2encode Average 
Figure 9: Energy-Delay product for 256 bytes data cache size. 
cjpeg epic pegwitdecode pegwitencode 
Benchmarks 
mpeg2decode 
lcFPB 19LRU •STREAM I 
mpeg2encode Average 
Figure 10: Energy-Delay product for 2 Kbytes data cache size. 
34 
Figure 9 and 10 show the ED for the 256 bytes and 2 Kbytes cache sizes, on average 
the ED of our scheme is minimal when compared with the LRU and STREAM models. For 
256 bytes, the ED of our scheme is 1.6, while the EDs of the LRU and the STREAM models 
are 1.7 and 1.8, respectively. For 2 Kbytes, the ED of our scheme is 1.37, while the EDs of 
the LRU and the STREAM models are 1.38 and 1.43, respectively. Hence the smallest 
amount of ED is achieved when the cache size is 2 Kbytes in our scheme. Now we look at 
the tradeoff between the percentage of pollution and the percentage of the cache hit on 
prefetched, results are shown in Table 9 and 10. We see that 256 bytes cache size provides 
large percent improvement than 2 Kbytes cache size in average for our scheme ( the 256 bytes 
provides 3.08% while the 2 Kbytes provides 1.72% ). However, for 256 bytes, the STREAM 
model achieves more percent improvement than our scheme. It provides a 4.76% 
improvement. On the other hand, our scheme works better than the STREAM model when 
the cache size is 2 Kbytes (our scheme provides a 1.72% improvement while the STREAM 
model provides a 1.08% improvement). 
35 
Table 9: Percentage of pollution from pref etched data and hit on pref etched block for 
256 bytes cache size. 
FPS 
Benchmarks Cache hit w/o prefetching Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
lcioea 3617719.0000 3494468. 0000 124768.0000 3.4069 3.4488 0.041 
epic 5533291.0000 4900253. 0000 1160469.0000 11.4405 20.9725 9.532 
oeawitdecode 3430182.0000 3332071.0000 52832.0000 2.8602 1.5402 -1.32( 
oeawitencode 5862241.0000 5680449.0000 96206.0000 3.1011 1.6411 -1.46C 
moea2decode 17527078.0000 17389364.0000 2130022.0000 0.7857 12.1528 11.367 
mpeg2encode 286305543.0000 282671777. 0000 4469032.0000 1.2692 1.5609 0.291 
Average 53712675.6667 52911397.0000 1338888.1667 3.8106 6.8860 3.075 
LRU 
Benchmarks Cache hit w/o prefetching Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
cjpeg 3617719.0000 3401747.0000 224243.0000 5.9698 6.1985 0.2286 
epic 5533291.0000 5534045.0000 1566.0000 -0.0136 0.0283 0.04H 
oeawitdecode 3430182.0000 3100215.0000 333120.0000 9.6195 9.7114 0.091 
oeawitencode 5862241. 0000 5344918. 0000 619823.0000 8.8247 10.5731 1.748 
mpeg2decode 17527078.0000 15154845.0000 4001899.0000 13.5347 22.8327 9.298( 
mpeg2encode 286305543.0000 281223722.0000 6105401.0000 1.7750 2.1325 0.3575 
Average 53712675.6667 52293248.6667 1881008.6667 6.6183 8.5794 1.9611 
STREAM16 
Benchmarks Cache hit w/o prefetchlng Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
cioea 3617719.0000 3543162.0000 107618.0000 2.0609 2.9747 0.9139 
epic 5533291.0000 5145387. 0000 1055733.0000 7.0104 19.0797 12.069, 
oeawitdecode 3430182.0000 3420337.0000 10942.0000 0.2870 0.3190 0.032( 
oeawitencode 5862241. 0000 5927668. 0000 44375.0000 -1.1161 0.7570 1.873( 
moea2decode 17527078. 0000 18087116.0000 1958950. 0000 -3.1953 11.1767 14.372 
moea2encode 286305543.0000 284625187.0000 3276182.0000 0.5869 1.1443 0.557' 
Average 53712675. 6667 53458142.8333 1075633.3333 0.9390 5.9086 4,!:Jbl:jt 
Table 10: Percentage of pollution from pref etched data and hit on pref etched block for 
2 Kbytes cache size. 
FPS 
Benchmarks Cache hit w/o prefetching Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
cloea 4346636.0000 4145215.0000 219437.0000 4.6340 5.0484 0.414!' 
epic 5922291.0000 5242953.0000 1292925.0000 11.4709 21.8315 10.360€ 
I oeawitdecode 4188852.0000 4096016. 0000 68616.0000 2.2163 1.6381 -0.5782 
I oeawltencode 6904558.0000 6766273.0000 131839.0000 2.0028 1.9094 -0.093' 
mpeg2decode 31608239.0000 27266173. 0000 4417585. 0000 13.7371 13.9761 0.238 
mpea2encode 322836835.0000 317120299.0000 5645746.0000 1.7707 1.7488 -0.021 
Average 62634568.5000 60772821.5000 1962691.3333 5.9720 7.6920 1.7201 
LRU 
Benchmarks Cache hit w/o prefetchlng Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement ("/4) 
leloea 4346636.0000 4084907.0000 277833.0000 6.0214 6.3919 0.370 
epic 5922291.0000 5920864. 0000 1820.0000 0.0241 0.0307 0.006t: 
I oeawitdecode 4188852.0000 3891032. 0000 300314.0000 7.1098 7.1694 0.0595 
I peawitencode 6904558.0000 6275508.0000 544183.0000 9.1106 7.8815 -1.2291 
moea2decode 31608239.0000 26642574.0000 5016850.0000 15.7100 15.8720 0.161 
mpea2encode 322836835. 0000 316051795.0000 6666882.0000 2.1017 2.0651 -0.036E 
Average 62634568.5000 60477780.0000 2134647.0000 6.6796 6.5684 -0.1112 
STREAM16 
Benchmarks Cache hit w/o prefetching Hit on Regular block Hit on prefetch block Pollution from prefetch (%) Hit on prefetch (%) Improvement(%) 
leloea 4346636.0000 4279171.0000 72103.0000 1.5521 1.6588 0.1067 
epic 5922291.0000 5497353. 0000 1077705. 0000 7.1752 18.1974 11.0222 
I oeawitdecode 4188852.0000 3924810.0000 5656.0000 6.3034 0.1350 -6.16B 
I peawltencode 6904558.0000 6977649.0000 8930.0000 -1.0586 0.1293 1.187 
mpeg2decode 31608239.0000 30523126.0000 1175148.0000 3.4330 3.7179 0.28< 
mpeg2encode 322836835.0000 320526762. 0000 2534259.0000 0.7156 0.7850 O.~ 
Average 62634568.5000 61954811.8333 812300.1667 3.0201 4.1039 1.0ti:c 
36 
5 CONCLUSIONS 
In this thesis, we studied the tradeoffs between power and performance using cache 
prefetching. In addition, we introduced a new technique called "FPB" to reduce cache 
pollution due to prefetching on a small cache size. Our scheme reduces cache pollution by 
allowing only one block in the cache set to hold prefetched data and applying a different 
replacement priority on prefetched data. The results from simulations show that our scheme 
with a 2 Kbytes cache size provides the miss rate reduction of 6.8% on average, while 
consumes 37% more power than the reference machine. Moreover, for this particular cache 
size, our scheme performs better than the STREAM model while consuming less power. 
However, in some cases prefetching schemes consume more power than the reference 
machine without improving the miss rate at all. This is because the prefetching technique 
does not capture memory access patterns accurately. Note that our simulations show that the 
EDs for all prefetching schemes were larger than the reference machine model. This is 
because the miss rate reduction, which we achieved from prefetching was not able to 
compensate for the extra power from hardware prefetching. Thus for making prefetching 
useful, extra hardware needs to be considered along with the improvement of the 
performance. Otherwise not only will performance be degraded but also power consumption 
will be increased. In order to make prefetching perform better in terms of power and 
performance tradeoffs, cache configuration and prefetching schemes should be properly 
selected. 
In the future, to make prefetching more useful and create less pollution while 
requiring less additional power, prefetching accuracy can be improved by turning it on or off. 
37 
A decision scheme to tum prefetching on or off would be a worthy future work for our 
scheme. Moreover, as prefetching technique brings data from the main memory before 
processor needs, careful scheduling of several "lumped" memory references from prefetching 
and cache misses may be useful. 
38 
REFERENCES 
[1] T. M. Austin, D. Burger, "The SimpleScalar Tool set, Version 2.0," Computer 
Science Dept. Technical Report, No. 1342, Univ. of Wisconsin, 1997. 
[2] R. Bechade and et. al., "A 32b 66MHz 1.8W microprocessor," Proc. of International 
Solid-State Circuits Conference, pages 208-209, 1994. 
[3] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A Framework for Architectural-
Level Power Analysis and Optimizations" Proc. of International Symposium on 
Computer Architecture, pages 83-94, 2000. 
[4] T. F. Chen and J. L. Baer, " A Performance study of software and hardware data 
prefetching schemes," Proc. of International Symposium on Computer Architecture, 
pages 223-232, 1994. 
[5] R. Chen, M. Irwin, and R. Bajwa, " An architectural level power estimator, " In 
Power-Driven Microarchitecture Workshop at ISCA25, 1998. 
[6] Chi-Hung Chi and S. L. Lan, "Data Prefetching With Co-Operative Caching," In 5th 
International Conference on High Performance Computing, HIPC'98, pages 25-32, 
1998. 
[7] M. J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design," 
Massachusetts: Jones and Bartlett, 1996. 
[8] J. Fu, J. Patel, and B. Janssens, "Stride directed prefetching in scalar processor," Proc. 
of the 25th International Symposium on Microarchitecture, pages 102-110, 1992. 
[9] J. L. Hennessy and D. A. Patterson, "Computer Architecture: A Quantitative 
Approach," New York: Morgan-Kaufmann, 1993. 
39 
[10] P. Hicks, M. Walnock, and R. M. Owens, "Analysis of power Consumption in 
Memory Hierarchies," Proc. of the International Symposium on Low Power 
Electronics and Design", page 239-242, 1997. 
[11] N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small 
fully-associative cache and prefetch buffers," Proc. of 1 ih Annual International 
Symposium on Computer Architecture, pages 364-373, 1990. 
[12] M. B. Kamble and K. Ghose, "Analytical energy dissipation models for low power 
caches," Proc. of International Symposium on Low Power Electronics and Design, 
pages 143-148, 1997. 
[13] J. Kin, G. M. Gupta, and W. H. Mangione-Smith, "The filter cache: an energy 
efficient memory structure," Proc. of 30th Annual International Symposium on 
Microarchi tecture, pages 184-193, 1997. 
[14] A. C. Klaiber and H. M. Levy, "An architecture for software-controlled data 
prefetching," Proc. of 18th Annual International Symposium on Computer 
Architecture, pages 43-53, 1991. 
[15] C. Lee, M. Potkonjak, and W. H. Mangione Smith, "MediaBench: A Tool for 
evaluating Multimedia and Communications Systems," Proc. of Micro 30, pages 330-
335, 1997. 
[16] Yue Lin and D. R. Kaeli, "Branch-Directed and Stride-Based Data Prefetching," 
Proc. of International Conference on Computer Design: VLSI in Computer and 
Processor, ICCD'96, pages 225-230, 1966. 
[17] J. Montanaro and e. al., "A 160MHz 32b 0.35W CMOS RISC Microprocessor," Proc. 
of International Solid-State Circuits Conference, pages 1703-1714, 1996. 
40 
[18] T. Mowry, M. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm 
for prefetching," SIGPLAN Notices, pages 62-73, 1992. 
[19] S. Palacharla and R. Kessler, "Evaluating stream buffers as a secondary cache 
replacement," Proc. of 21 st Annual International Symposium on Computer 
Architecture, pages 24-33, 1994. 
[20] S. Palacharla, N. Jouppi, and J. Smith, "Complexity-Effective Superscalar 
Processors," Proc of 24th International Symposium on Computer Architecture, pages 
206-218, 1997. 
[21] A. K. Porterfield, "Software methods for improvement of cache performance on 
supercomputer applications," Rice Univ., Houston, TX, Tech. Rep. COMP TR 89-93, 
1989. 
[22] V. Santhanam, E. H. Gomish, and W. C. Hsu, "Data prefetching on the HP PA-
8000," Proc. of 24th Annual International Symposium on Computer Architecture, 
pages 264-273, 1997. 
[23] A. J. Smith, "Cache memories," ACM Computing Surveys, 14:473-530, 1982. 
[24] C. Su and A. Despain, "Cache design trade-offs for power and performance 
optimization: a case study," Proc. of International Symposium on Low Power 
Electronics and Design, pages 63-68, 1995. 
[25] N. Weste and K. Eshraghian, "Principle of CMOS VLSI Design," New York: 
Addison-Wesley, 1993. 
[26] S. Wilton and N. Jouppi, "An Enhanced Access and Cycle Time Model for On-chip 
Caches," In WRL Research Report 93/5, DEC Western Research Laboratory, 1994. 
41 
[27] D. F. Zucker, M. J. Flynn, and R. Lee, "A comparison of hardware prefetching 




I would like to thank my major professor Dr. Gyungho Lee for his constant guidance, 
direction and assistance. 
I would like to thank Dr. Manimaran Govindarasu, and Dr. Shashi Gadia for being on 
my committee. 
I would like to thank my parent, my wife, and my friends from Geographic 
Information Systems facility Lab for helping me directly or indirectly with my graduate 
program and for all the encouragement. 
