Feasibility analysis of correlation based prefetching using digital signal processing by Morse, David L.
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
2004
Feasibility analysis of correlation based prefetching
using digital signal processing
David L. Morse
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Morse, David L., "Feasibility analysis of correlation based prefetching using digital signal processing" (2004). Thesis. Rochester
Institute of Technology. Accessed from
FEASIBILITY ANALYSIS OF CORRELATION 
BASED PREFETCHING USING DIGITAL SIGNAL 
PROCESSING 
By 
David L. Morse 
August 2004 
A thesis submitted in partial fulfillment of the 
requirements for the degree of 
Master of Science in Computer Engineering 
Rochester Institute of Technology 
Approved by: 
Greg P. Semeraro 
Primary Advisor: Dr. Greg Philip Semeraro, Assistant Professor 
Juan Cockburn 
Advisor: Dr. Juan Carlos Cockburn, Associate Professor 
M.Shaaban 
Advisor: Dr. Muhammad Shaaban, Assistant Professor 
THESIS RELEASE PERMISSION FORM 
Rochester Institute of Technology 
Kate Gleason College of Engineering 
Title: Feasibility Analysis of Correlation Based Pre fetching using Digital Signal 
Processing 
I, David Morse, hereby grant pennission to the Wallace Memorial Library to 
reproduce my thesis in whole or in part. 
David Morse 
David Morse 
rt 1)3. b06 ivj 
I I \ 
Date 
Dedication
This work is dedicated to my mother who didn't make it long enough to see me
graduate from college.
Acknowledgements
The author would like to thank Dr. Greg Semeraro, Dr Juan Cockburn, and Dr.
Muhammad Shaaban for participating in the author's thesis committee. The author
would also like to thank the Rochester Institute of Technology and in particular the
Computer Engineering Department of the Kate Gleason College of Engineering and
its entire faculty for the exemplary job it has done in preparing the author for the
research work required by this thesis.
ABSTRACT
FEASIBILITY ANALYSIS OF
CORRELATION BASED
PREFETCHING USING DIGITAL
SIGNAL PROCESSING
by David Morse
Supervising Professor: Dr. Greg P. Semeraro
Department ofComputer Engineering
As the gap between processor performance and memory performance continues to
broaden with time, techniques to hide memory latency such as correlation based
prefetching become exceedingly important. When a memory reference issued by the
processor misses the level one cache, the request propagates down the memory
hierarchy until it finally finds the requested datum. With each layer traversed, the
latency grows exponentially. Prefetching is a technique used to hide this latency by
attempting to predict which memory references will be requested in the near future,
and then load them into cache before they are needed.
This work investigates the use of digital signal processing techniques in designing an
effective prefetch algorithm. The algorithm proposed in this work uses the Kalman
Filter as the basic digital signal-processing block. The sequence of memory address
references with respect to time is interpreted as a digital signal. By applying Kalman
filtering techniques, a robust prediction algorithm is presented to predict future miss
references based on the pattern of previous miss references.
The algorithm was simulated using 40 benchmark programs from the Olden,
MediaBench, and SPEC benchmark suites for the Alpha 21264 and the PISA (a MIPS-
like ISA) instruction set architectures. A main difference between these two ISAs is
that the Alpha 21264 ISA contains software prefetch instructions, and the PISA
instruction set architecture does not. The simulations place a prefetcher unit between
the level one data cache and the level two unified cache. SimpleScalar simulation
results for a broad set of benchmark programs using 32 Kalman filter blocks show an
average of 6.5% speedup for the Alpha 21264 ISA, and an average of 5.6% speedup
for the PISA instruction set architecture for those benchmark programs which have a
potential speedup from prefetching greater than 10%.
TABLE OF CONTENTS
Dedication iii
Acknowledgements iv
ABSTRACT v
GLOSSARY ix
Chapter 1 1
Introduction 1
Chapter 2 6
Motivation 6
Chapter 3 15
Prefetching and the Kalman Filter 15
Compiler based prefetching 15
Stream Buffers 16
Stride Prefetchers 19
Correlation based prefetching the Markov Predictor 21
The Kalman Filter 24
Chapter 4 30
Algorithm Design 30
Chapter 5 45
Algorithm Evaluation 45
Chapter 6 63
Simulation 63
Evaluation metrics 63
Integrationwith SimpleScalar 64
Simulated system 68
Results 69
Compiler Optimization Levels 76
Chapter 7 79
Discussion 79
Timeliness 82
Feasibility and practicality 84
Chapter 8 88
Conclusions 88
Future Directions 90
Bibliography 92
VTl
LIST OF FIGURES AND TABLES
Number Page
Figure 1.1: Memory hierarchy for architecture simulated 2
Figure 2.1: Excerpt from data level one cache miss reference stream for equake
(Alpha) 12
Figure 3.1: The stream buffer, (taken from [22]) 17
Figure 3.2: Stride prefetch algorithm hardware (taken from [35]) 19
Figure 3.3: Hardware implementation of the Markov Predictor (taken from [4]) 22
Figure 4.1: General algorithm layout using n internal filters 33
Figure 4.2: Flowchart of control unit's logic when receiving a miss reference 38
Figure 4.3: Flowchart of control unit's logic when receiving a cache hit 40
Figure 4.4: Simplified implementation diagram of control logic and prefetch logic 44
Table 5.1: Benchmarks used 48
Table 5.2: Algorithm evaluation parameters 49
Figure 5.3: Percentage of correct prefetches out ofprefetches issued forAlpha 50
Figure 5.4: Percentage ofmemory access correctly prefetched for Alpha 51
Table 5.5: Accuracy of algorithm for Alpha ISA 52
Table 5.6: Effectiveness of algorithm forAlpha ISA 53
Figure 5.7: Accuracy of algorithm for PISA 55
Figure 5.8: Effectiveness of algorithm for PISA 56
Table 5.9: Accuracy of algorithm for PISA 57
Table 5.10: Effectiveness of algorithm for PISA 58
Table 5.11: Effectiveness of algorithm for Alphawith large NKF values 59
Table 5.12: Effectiveness of algorithm for PISA with large NKF values 60
Figure 5.13: Effectiveness of algorithm vs. NKF 61
Figure 6.1: Flowchart of SimpleScalar logic to implement the algorithm 67
Table 6.2: Architecture configuration used for the simulations 68
Figure 6.3: Speedup on Alpha benchmarks with potential speedup greater than 10% 70
Table 6.4: Average speedups for Alpha benchmarks with potential speedup greater
than 10% 70
Figure 6.5: Speedup on Alpha benchmarks with potential speedup less than 10% 72
Figure 6.6: Speedup on PISA benchmarks with potential speedup greater than 10% 73
Table 6.7: Average speedups for PISA benchmarks with potential speedup greater than
10% 74
Figure 6.8: Speedup on PISA benchmarks with potential speedup less than 10% 75
Figure 6.9: Average speedup for various optimization levels ofAlpha 76
Figure 6.10: Speedup for various optimization levels ofPISA 77
Vlll
GLOSSARY
Access Pattern The sequence ofmemory accesses issued by a program
GUI Graphical User Interface
Memory Reference Stream Sequence ofmemory addresses references by an
execution of a program
Miss Reference Stream Sequence of cache misses that are caused by an
execution of a program
NKF Number ofKalman Filters - the number of filters used
in the algorithm. Dictates the implementation complexity of the algorithm.
RDM Recycling Delta Multiplier - the factor that multiplies
the amount of cycles between consecutive input points for a particular filter. Used in
conjunction with the RT to recycle filters back to the available pool.
RT Recycling Threshold the number of cycles inwhich a
filterwill be recycled if it does not receive another input point.
Sim-outorder A product of the SimpleScalar CPU Simulation suite
that simulates an out of order issue superscalar microprocessor
SRAM Static Random Access Memory
VSTT Virtual Stream Thresholding Tolerance the threshold
in bytes used to determine if a miss reference belongs to a particular filter
IX
Chapter 1
Introduction
The problem of cache miss latency continues to grow as processor speeds increase
faster than their memory system counterparts. During a given execution of a
program on a microprocessor, according to the instructions in the program, the
processor will need to access memory in various locations to perform its task.
Since processor performance is increasing at a better rate than memory
performance [31], the penalty incurred for accessing memory continues to increase.
To combat this problem, the cache was designed to serve as an intermediate step in
the memory hierarchy that holds the currently being used memory values in a high
speed SRAM (Static Random Access Memory). The concepts of temporal locality
and spatial locality play a part in traditional cache design, because they assume that
memory addresses that have recendy been accessed have a higher probability of
being accessed again soon, and the memory addresses near that address also have a
higher probability of being accessed.
A cache works by bringing blocks ofmemory into the cache as the processor requests
them. Once the block is in memory, the processor can access and modify these
memory values as though it were directly modifying the values in the main memory.
Eventually the cache will become full and existing cache blocks have to be removed
from the cache to make room for new cache blocks. At this time the contents of the
cache block are written back to main memory and a new cache block is loaded into the
previous cache block's space. There are several variations of cache designs and policies
that attempt to maximize performance, but the algorithm presented in this work does
not require any specific cache design.
The algorithm used in this work is simulated on an architecture that uses three caches,
an instruction level one cache, a data level one cache, and a unified level two cache.
They are organized in the memory hierarchy as described in figure 1.1 [31].
Processor
Data LI
cache
Instruction
LI cache
1
i
r
i i
1 r
Unified L2 cache
1
i
f
L
Main memory
Figure 1.1: Memory hierarchy for architecture simulated
As the processor executes a program, it will encounter instructions that reference
various addresses in memory. These are the load and store instructions, which read
and write values from and to memory. These memory instructions are given memory
addresses as their arguments. This memory address is then transferred to the data level
one cache, which searches through its contents to see if it has the particular value in its
cache. If it does, it quickly returns the value contained in that address. If not, it will
send an inquiry to the unified level two cache to see if it has the value. Since the level
two cache is much larger that the level one caches, it may have this value even when
the level one cache does not. If it does, it returns the cache block to the level one
cache that inserts it into its own cache, and returns the value requested to the
processor. If it does not, it in turn inquires the main memory (or next level of cache if
there are more levels) for the address. For each level of the memory hierarchy that has
to be traversed to find a memory value, the latency increases, and the processor has to
wait a longer amount of time for the value it needs to continue processing. Prefetching
is a process in which these memory references are predicted, and loaded into cache before
the processor asks for the values. The goal is to have a net decrease in execution time
of a program. The sequence of these memory addresses is the memory reference stream,
and the sequence of these references that are cache misses (the cache does not have
the value), is referred to as the miss reference stream.
The algorithm presented in this work focuses on prefetching for the data level one
cache, though it could be applied to the instruction level one cache and the unified
level two cache as well. It is based on the presence of access patterns in the memory
reference stream that can be acknowledged by the prefetcher hardware and accurately
predicted. This algorithm treats the memory reference stream as a digital signal, which is
simply a series of discrete points, and as the basis of the prefetcher prediction logic,
traditional digital signal processing techniques are applied. The design of the algorithm
presented in this work is configured to recognize and lock on to strided access patterns
in the data. The result is a prefetcher that is shown in simulation using SimpleScalar
[15] to actually decrease execution time for benchmarks from the Olden [32],
MediaBench [33], and SPEC2000 [11] benchmark suites. The evaluations show an
average reduction in miss references of 29.4% for the Alpha ISA benchmark
programs, and 40.9% for the PISA architecture benchmark programs. The
SimpleScalar simulations of this algorithm show an average speedup of 6.5% for the
benchmark programs for the Alpha ISA, and 5.6% for benchmark programs for the
PISA architecture.
The remainder of this document is organized as follows: Chapter 2 describes the
motivation for this work, in terms of existing prefetch designs and how this algorithm
can improve on them. In Chapter 3, detailed explanations of some currently proposed
prefetch algorithms are discussed, along with a theoretical background of the Kalman
Filter algorithm. In Chapter 4, the design of the prefetch algorithm proposed by this
work is presented. In Chapter 5, a theoretical evaluation of the effectiveness of this
algorithm is conducted. In Chapter 6, the code modifications to SimpleScalar [15], and
the results of the SimpleScalar simulations are presented. In Chapter 7, a discussion of
the algorithm and a comparison of the theoretical results with the simulation results
are given. In Chapter 8, there are concluding remarks.
Chaptet2
Motivation
There are many classes of applications that suffer from the latency of memory
operations, and as processor clock speeds increase, this disparity grows larger [1, 2, 14].
One major technique to hide this memory latency is prefetching. Prefetching schemes
have been proposed using both software [19] and hardware [1, 2, 3, 4, 5, 10, 16]
techniques. Prior research has found that many schemes have been proven to have a
positive impact on reducing the number of cache misses for various programs, but no
prefetcher is ideal in all situations. In addition, to be effective a prefetching algorithm
must also be shown to actually reduce execution time [12].
The most basic prefetch scheme is to simply increase the cache line size [20]. This
relies on the concept of spatial locality, and automatically prefetches data at memory
addresses that are near the original miss address. This idea is extended in [21] with
Sequential Prefetching. Sequential Prefetching does not just fetch one cache line, but
fetches the next few cache lines as well, in hopes they will be accessed soon. A further
improvement on this scheme is presented in [22], the stream buffer. The stream buffer
stores prefetched data in a separate hardware unit from the cache, and as data is
consumed from the stream buffer, it continues to prefetch sequential cache lines.
These stream buffers are fetched using a unit stride, fetching each cache line one after
another, but this idea was later extended with the introduction of the stride prefetcher
[23, 35], which calculated a non-unit stride between memory accesses by a given load
instruction.
Prefetchers that are based on spatial locality such as the stream buffer and stride
prefetcher provide a reduction in cache misses for applications that access data in
regular patterns, such as many scientific applications. However, for applications that
access data in non-regular patterns such as applications that use linked data structures,
these prefetchers do not achieve the same success than they do in other applications
[1,4]. For this reason, correlation based prefetchers have been proposed [1, 2, 4, 5, 24].
Correlation based prefetchers work by attempting to exploit any correlation between
previous memory accesses and other execution parameters with future memory
accesses. If these future memory accesses can be predicted with some amount of
accuracy, then a prefetcher can be employed to bring these memory addresses into
cache early and hide memory latency.
Typical correlation based prefetchers employ the use of a correlation table [4] in which
memory accesses are stored in a table along with future memory accesses. Joseph and
Grunwald [4] propose the use of a Markov model to generate the probabilities that a
given memory access will follow some other memory access. These patterns are stored
in the correlation table structure alongwith their transition probabilities. When there is
a miss reference, the table is consulted to see what the next miss references are likely to
be, and places prefetch requests into a prefetch request queue. While the program is
ninning, the transition probabilities are constantly being updated to determine the
most likely patterns ofmiss references. The downside of this model is that it can take a
long time for the system to build a suitable table of transition probabilities given a
certain program, and also requires a large amount of hardware to store the correlation
table.
This work proposes a new approach to correlation based prefetching that does not
require large tables of values to be stored in memory, and has the potential of not
needing long periods of time and many repeated memory access patterns to be able to
accurately issue prefetches. This work evaluates the feasibility of using digital signal
processing to implement a flexible prefetch algorithm.
Digital signal processing has previously been proposed for use in microarchitecture for
branch prediction [17]. The FAB Predictor uses Fourier analysis on branch outcomes
to predict future branches and achieve processor speedup. The main benefit that the
FAB predictor has over other branch prediction schemes is that it can represent
patterns of branch outcomes that are very long, and would normally take an unrealistic
amount of memory to store. By transforming branch history patterns from the time
domain to the frequency domain, it is possible to store patterns that would take 2 bits
in the time domain in only 52 bits in the frequency domain. This allows detection of
patterns that have a very long period. Digital signal processing is also used in other
areas of microarchitecture. In a Multiple Clock Domain (MCD) processor [18], the
clock signal is divided across the processor die into multiple domains. To control the
frequencies of each domain, an Attack-Decay algorithm based on digital control
techniques [25] is used, with the objective being to save energy by reducing the clock
frequency in domains that are not fully active for given portions of time.
The miss reference stream of a program execution is defined as the sequences of level-
one cache miss addresses. In this work, the term miss reference steam refers to the
level one data cache miss reference stream, unless otherwise noted. The miss reference
stream of a program execution can be treated as a digital signal with respect to time.
Each miss address is a discrete signal point. As mentioned earlier, correlation based
prefetchers attempt to extrapolate correlations between past miss references to predict
future references, but many of the proposed prefetchers accomplish this by creating
and maintaining large history tables which are examined to see if a prefetch could be
made. The algorithm proposed in this work investigates the use of the Kalman Filter
[6, 7, 8, 9] to analyze the miss reference stream and use the prediction capabilities of
the Kalman Filter to determine prefetch addresses. The Kalman Filter was chosen
because of its flexibility, ease of implementation, and ability to be pipelined [7, 26].
Because of this and the algorithm's tolerance for latency, an implementation of this
algorithm would not impact processor cycle time. Using the miss reference stream as
the input signal for the Kalman Filter, the predictions are stored and when a previously
predicted value matches the next miss reference, a prefetch is issued for the next value
predicted by the Kalman filter. This way, if the miss reference stream is behaving in a
predictable manner according to the state model assumptions of the Kalman filter,
then predictions can continue to be made, and prefetches can continue to be issued.
Since the Kalman Filter only needs to store its internal state parameters, large tables of
values do not need to be stored to implement the prefetcher. The main limitation of
this prefetch algorithm is the computational complexity of implementing the Kalman
Filter algorithm, but the International Technology Roadmap for Semiconductors
(ITRS) [30] predicts that the trend of increasing transistor counts on microprocessors
will continue, enabling these filters to be implemented. An argument can also be made
to simply increase sizes of look-up tables to enhance other existing prefetch
algorithms, but as memory sizes increase, the speeds tend to decrease. The algorithm
proposed in this work does not have that limitation.
The algorithm proposed here consists of a variable number of Kalman Filters. Each
Kalman Filter is responsible for analyzing its own virtual miss reference stream. The overall
miss reference stream is a conglomerate of all the miss references issued by the
program, and can be thought of as the superposition of multiple individual miss
reference streams, which when divided can be analyzed by a Kalman filter to produce
correct predictions. Each virtual miss reference stream is one of these individual
reference streams. In order to determine which miss reference belongs to which
stream;, simple thresholdingwith respect to miss address is used. If the miss address is
within the threshold value of the previous miss address that was routed to a given
Kalman Filter, then the current miss address is routed to that filter. If none of the
10
filters had a previous value that is within the threshold, then a new filter is initialized
with the current point. If all filters are currently in use, the point is ignored.
A main benefit of this algorithm and its method of using the Kalman filter rather than
storing large tables of values is in the context-switching environment of an actual
operating system. When a context switch occurs, typical prefetch algorithms need to
flush their predictor tables and begin learning again if they are made aware of the
context switch, or will just begin running the unrelated code with inaccurate data
structures guiding their prediction algorithms. The algorithm proposed in this work
can typically begin to prefetch accurately within four cache miss references of a given
virtual stream, so a context switch does not have as much of a penalty as it does on
other prefetch algorithms.
Figure 2.1 shows an example of the miss reference stream for equake (Alpha 21264
ISA). There are several virtual streams visibly present, each of which having a strided
access pattern.
11
X107 Miss Reference Stream for Equake [Data Level 1 cachej
1.73 -
i
1.725 - -
i/p
V
172
r.
i/i
5
1.715
1.71
1.705
I 1 1 iii,
-
4.8 4.9 5 5.1
Cycle number
5.2 5.3 5.4
x107
Figure 2.1: Excerpt from data level one cache miss reference stream for equake (Alpha)
The algorithm proposed in this work is functionally similar to the DSTRIDE predictor
[10]. The DSTRIDE predictor uses the miss reference stream and maintains a Stride
Prediction Table of previous miss addresses. Comparisons are made between the past
few miss addresses and the current miss address to determine all possible stride values.
The algorithm also uses a Steady Stride Table with state bits to determine when a
steady stride exists, and issues prefetches based on that. The algorithm proposed here
is similar in the respect that the proposed algorithm uses a particle-with-constant-
velocity state model for the Kalman Filters. These Kalman Filters are designed to be
most effective in detecting and predicting stride patterns. However, the proposed
12
algorithm does not need to store tables of prediction values. Each Kalman Filter used
maintains the history of its miss reference stream via its recursive nature.
Typical stride prefetchers also require knowledge of the program counter, or the
nature of the current instruction being executed [23]. The algorithm proposed in this
work does not require this information. It only needs access to the miss references
issued by the data level-1 cache, and a data port on the data level-1 cache with which
to insert the prefetches into the data stream. Due to this design, a physical
implementation of this prefetching algorithm can execute in parallelwith the processor
such that it would not affect the manner in which current processors are designed. In
fact, it would be quite simple to augment many existing processor designs to take
advantage of this algorithm.
The objective of this work is to present an algorithm which uses digital signal
processing techniques, specifically the Kalman Filter, to perform prefetching. The
algorithm will then be evaluated in a theoretical manner using pre-generated miss
reference streams to determine its effectiveness in being able to predict future miss
references. It will then be integrated into a cycle-by-cycle super-scalar processor
simulator to determine its effect of execution time for a set of benchmark programs.
The experiments conducted show that it is in fact feasible to use these digital signal
processing techniques in designing an effective prefetch algorithm. The contribution of
13
this work to microarchitectural research consists of an algorithm that is shown to be
successful in reducing execution times of a number ofbenchmark programs.
14
Chapter3
Prefetching and the Kalman Filter
There have been many prefetch algorithms previously proposed, and many of the
algorithms have been shown to be effective in reducing the number of cache misses a
program execution experiences for certain conditions. Both software [19] and
hardware [1, 2, 3, 4, 5, 10, 16] approaches have been proposed. Most prefetch
algorithms are designed by first identifying predictable patterns in the memory accesses
in a program, and then trying to exploit those patterns. This section will describe a few
of the mostwell known prefetch algorithms.
Compiler based prefetching
One technique for prefetching is done at compile time. The compiler analyzes the
workload and determines if it is appropriate to insert special PREFETCH instructions,
which will load memory into the cache early, in an attempt to have it ready in cache
when the processor needs it. Mowry et al. [19] proposes an algorithm for embedding
compiler-time prefetches in compiled code for scientific applications, especially those
that deal with dense matrices. One of the main issues with software prefetching is that
nothing is known at runtime about the dynamics of an execution, and the prefetch
algorithm cannot adapt. Unnecessary prefetches are also an issue in this algorithm,
because the data may already be in cache at the time the prefetch is issued. Since the
15
compiled code does not know what is in the cache at the time it executes the prefetch,
these prefetch instructions may be using processor cycle time without adding any
benefit. The simulation of this algorithm in [19] showed that for the selected
benchmarks, 50% to 90% of the processor cycles used while stalled waiting for
memory accesses were eliminated, and it required on average less than a 15% increase
in the instruction counts for these programs.
The existence of compiler based prefetching is important to this work because the
Alpha 21264 architecture [34] does contain special instructions for prefetching, which
will load cache blocks into the data level one cache if they are not already present. Of
course, if the blocks are already in cache, then this is a wasted instruction. The PISA
architecture has no analogous instruction in its ISA.
Stream Buffers
Jouppi [22] proposed the use of a stream buffer, which is a FIFO queue full of cache
lines that have been prefetched due to a cache miss reference. This technique employs
a secondary hardware unit that is checked in parallel with the level one cache. When
there is a miss reference, the stream buffer will prefetch the next few cache lines and
store them in this queue, in hopes that the processor will need these cache lines next.
When the cache is checked for existence of a memory address, the head of the stream
buffer queue is also checked. If the requested cache line is in the stream buffer, then it
is moved into the level one cache and the stream buffer queue is moved forward.
16
Under good circumstances, this process will continue and the stream buffer will be
able to continually provide the next requested cache line.
From processor To processor
Head entry
Tad entry
Direct mapped cache
Stream buffer
(FIFOQueue version)
To next lower cache From next lower cache
Figure 3.1: The stream buffer, (taken from [22])
The stream buffer FIFO as shown in figure 3.1 is made up of a sequence of entries,
each ofwhich has a cache line tag, an available bit, and a cache line of data. When a
line is prefetched, the tag is inserted into the queue with the available bit set to false.
When the cache line arrives, the available bit is set to true. In the most basic version of
the stream buffer, only the head of the FIFO is compared in parallel with the cache to
determine if the cache line is available, so even if the requested cache line is later on in
the queue, it will not be available. More advanced versions of the stream buffer allow
17
parallel concurrent searching of the entire queue for the requested cache line, at the
cost of additional hardware complexity.
The stream buffer works well for the instruction cache, but does not have the same
success with the data cache, since data accesses are typically interleaved from different
sources (the algorithm proposed in chapter 4 refers to these streams as virtual miss
reference streams). An expansion on the stream buffer is the multi-way stream buffer
which is actually a collection of stream buffers, each of which is working on its own
virtual miss reference stream. When a cache access occurs, the head of each of the
queues is examined in parallel for a match. If there is a miss reference that is not in any
of the queues, then the least recently used queue is flushed and assigned to the new
stream.
Furthermore, if the memory accesses are not sequential or unit stride, then the stream
buffer has difficulty. If the data access pattern is a strided cache-block access pattern,
then the stream buffer does not work as well. The standard stream buffer algorithm
proposed in [22] showed a reduction in data level one cache misses ranging from about
6% to 25% for the selected benchmarks. The instruction level-one cache misses were
reduced by figures between 60% and 80%, due to the sequential nature of the
instruction stream.
18
Stride Prefetchers
Fu et al. [35] propose a stride prefetcher, which is another hardware-based prefetching
algorithm that is designed to detect and prefetch based on strides in the memory
accesses. It works by rnaintaining a Stride Prediction Table (SPT) in a cache-like
hardware structure. The SPT contains a series of entries that consist of an instruction
address (IA) field, a last memory address (MA) accessed field, and a valid (V) bit. As
instructions that reference memory are executed, they are stored in the SPT alongwith
the memory address they referenced. Figure 3.2 shows the design of the stride
prefetcher.
instruction address memory address
<JA> (MA)
INSTRUCTION
ADDRESS
(IA)
l_t
LASTMEMORY
ADDRESS
(MA)
ranparattH subtracter
adder
spthit prefetch address
Figure 3.2: Stride prefetch algorithm hardware (taken from [35])
19
When an instruction references a memory address, the SPT is checked. If there is a
matching IA in the SPT with the V bit not set, the IA and MA pair will overwrite this
entry and set the V bit; this is an SPT miss. If the IA is in the SPT and the V bit is set,
then the stride will be calculated based on the current memory address and the MA in
the SPT. If the stride is not zero, then a prefetch can be issued. The SPT entry is then
updated with the new IA/MA pair.
Chen and Baer [23] propose a similar stride prefetch algorithm that relies on a Look-
ahead PC and branch prediction to calculate strides and issue prefetches. This
algorithm addresses the issue of timeliness in prefetching, whereas the algorithm
proposed in [35] does not. The look-ahead PC is designed to look ahead at possible
instruction references that might be issued a number of cycles in the future equal to
the memory latency. If the look-ahead PC determines it can make a prefetch, the idea
is that by the time the real PC reaches this instruction, the memory latency will have
already passed and the data will be immediately available in the cache for the processor
to access.
The stride prefetcher is able to prefetch memory accesses in patterns that the
traditional stream buffer cannot, due to the stream buffer's limitations on its fetching
patterns. A major implementation issue is that the stride prefetcher requires access to
the instruction stream. A stride prefetcher thus has a high impact on processor design
requiring it to be designed into the core processor chip design.
20
Correlation based prefetching theMarkov Predictor
Correlation based prefetching [1, 2, 4, 5, 24] is a different style of prefetching
compared to compiler based prefetching, stream buffers, and stride prefetchers.
Correlation based prefetching attempts to find a correlation between past miss
references and future miss references, rather than just prefetching several blocks of
sequential cache lines whenever there is a miss reference, such as in the stream buffer
case. One correlation-based prefetcher is Joseph and Grunwald's [4] Markov Predictor.
The Markov predictor builds and utilizes Markov prediction sequences based on the
history ofmisses that were previously encountered in an execution. By inferring that a
given pattern on cache misses will probably repeat in the future with some probability,
these patterns can be stored in a cache-like structure and can be used to predict future
memory access patterns. Figure 3.3 shows the portion of the Markov prefetcher design
that determines how to issue prefetches.
21
Current
Miss Reference
Address
Next Address
Prediction Registers
< M ,
MtssAddr 1 1- Pred 2" Pred a*" Pred 4* Pred
** 1- Pred 2nd Pred 3"*Pred
4"
Pred
- 1- Pred 2- Pred 3* Pred 4th Pred
MissAddrN 1" Pred 2"d Pred 3d Pred 4* Pred
Prefetch
Request
Queue
Prefetch Request
Prefetch Request
I CPUAddress
MUX
L2
I
Request
iUrji
To
CPU
Figure 3.3: Hardware implementation of theMarkov Predictor (taken from [4])
The Markov Predictor works by watching miss reference patterns from a program
execution and storing the miss references that are most likely to be issued after a given
miss reference. The implementation does this by using a table of entries, each ofwhich
contains a miss address, and four possible miss addresses that may occur next, based
on previous execution history. In one of the scheme variations, each miss address also
has an associated least recendy used (LRU) identifier that identifies which of the four
22
possible prefetch addresses to issue a prefetch on. As code executes, this table is
continually updated and used to issue prefetches. When a miss reference is
encountered, prefetches are issued for the possible future miss references and placed in
a 16-entry FIFO queue. These prefetches are then issued according to bandwidth
availability. Demand-fetch cache misses always have the highest priority.
When a prefetch is satisfied, it is placed in a fully associative stream-buffer like
structure instead of the level one cache (not shown in Figure 3.3). This structure is
then examined in parallel with the level one cache to determine if there is a cache hit.
If an entry from this structure is used, it is transferred to the real level one cache. In
contrast to the traditional stream buffer, any of the entries may be used, and the
corresponding later entries will be shifted up one slot if an earlier entry is removed.
This approach showed to work well in trace driven simulation when combined with
stride prediction and stream buffers [4]. The problem with this algorithm is that the
correlation tables are created from past executions of programs such that future
executions of the same programs will then see the most benefit. It is proposed in [4]
that it would be used across all executions on a processor in order to populate the
table. However, no mention is made to the effect of multiple programs running in
different memory spaces and polluting the tables. The table may not reflect the proper
memory transitions by the time a given program is executed, profiled into memory
transitions in the tables, and then executed again. The table only has a finite amount of
23
memory allocated to it, and it might require a very large table to be able to work in the
proposed manner. The size of the table is limited to the amount ofmemory available
to the prefetcher implementation, but there are no simulations that deal with a multi-
program environment like that that a typical microprocessor would be subjected to,
only single execution traces.
Many of the hardware prefetch algorithms discussed make use of a hardware unit that
stores the prefetched values, which is then checked in parallel with the level one cache.
This works to reduce polluting the cache with useless prefetch values and replacing
usable cache blocks from the cache. This algorithm proposed in section 4 does not use
such a structure, but that may be a possible enhancement to the proposed architecture.
The Kalman Filter
In 1960, R. E. Kalman [6] proposed his Kalman filter as a new recursive optimal linear
estimator. The Kalman filter is not as much of a 'filter' as it is a data processing
algorithm that recursively operates on discrete data points without the need to store all
past data points. Take for instance a physical process, which needs to be estimated.
First, a mathematical model of the process is derived, and represented in a state matrix
transition form. However, since this model is not precisely accurate with respect to the
actual process, there are inherent errors in the mathematical calculations. The Kalman
filter maintains a state model of various state parameters, which are generally the state
parameters for this process that are not explicidy known. The Kalman filter uses input
24
data (measurements of the process) that is assumed to have a Gaussian error
probability density function, incorporates past measurements with current
measurements and in accordance with a specific degree of reliability, estimates the new
values of the state variables based on the measurements. Basically, a Kalman filter can
produce the best estimate for a set of state variables based on the measurements and
the knowledge of how accurate the measurements and the process state transitions are.
With respect to the physical process and the measurement of that physical process, the
Kalman filter can produce an optimal estimation of the values of those state
parameters.
In [7] and [8], a more simplified introduction to the Kalman filter is given, and the
most relevant points are summarized here. The Kalman filter is defined by five
matrices, the n x n system matrix A and optional n x 1 input matrix B , the m x n
observation matrix H , the n x n system covariance matrix Q, and the m x m
observation covariance matrix R . These matrices represent the state transition
parameters, the observable variables, the uncertainty in the state transitions, and the
uncertainty in the observation, respectively.
The Kalman filter attempts to estimate the linear stochastic difference equation for
jcett":
xk=Axk_x+Buk_x+wk_x (3.1)
25
with a process measurement z
5Rm determined by:
zk=Hxk+vk. (3.2)
The variables wk and vk are random variables representing the process and
measurement uncertainty, respectively. These are random variables that are assumed to
be independent,white, and have normal probability distributions governed by:
p(w)~N(0,Q). (3.3)
p(v)~N(0,R). (3.4)
Two other variables pertaining to the algorithm are the n x n estimation error
covariance matrix Pk and the n x m Kalman gain matrix Kk . The derivations of these
parameters have been omitted for brevity, but they represent the covariance of the
error between the estimation of the current state and the actual current state, and the
blending factor between the measurement and the previous prediction that minimizes
the covariance of the error in the current measurement, respectively. These matrices
govern the balance between how much a measurement is trusted vs. how much a
prediction is trusted during the Kalman filter updates.
The discrete Kalman filter algorithm consists of an ongoing process between two
phases: the time update, or prediction stage, and the measurement update, or
correction stage. In the time update stage, the previous iteration's state variables are
26
used to predict what the current state variables will be. This is the apriori estimate, and
is denoted with a superscript minus sign, e.g. xk . Note the hat symbol (A), which
denotes this is a predicted value. In the time update step, the estimation error
covariance Pk is also updated. The time update equations are:
x'k =Axk_x+Buk_x (3.5)
Pk-=AP^AT+Q (3.6)
These equations predict from the previous iteration's time step k-1 to the current time
step k. After the time update equations have been processed, these a priori estimate
values are subjected to the measurement update equations, which incorporate the
current measurement value to determine the new estimated state. The measurement
update equations are:
Kk=PkrHT(HPk-Hr+R)-1
(3.7)
xk=x-k+Kk(zk-Hx-k) (3.8)
Pk=(I-KkH)Pk~
(3.9)
The measurement equations incorporate the observation data zk to produce the a
posteriori estimate. First, the new Kalman gain Kk is calculated using the observation
27
covariance matrix R and the apriori estimate for the estimation error covariance matrix
Pk in (3.7). This new gain represents how much the new observation is trusted in
comparison to the estimated value. Measurement update equation (3.8) computes the a
posteriori estimate of the state variables and uses this new gain matrix to integrate a
portion of the new measurement with the previous estimation value. Finally, the a
posteriori estimate of the estimation error covariance matrix is calculated from equation
(3.9).
The time update and measurement update steps continue to alternate, producing new
predictions and continually correcting its predictions with measurement information.
Because of this predictive property that the Kalman filter has, it is a feasible choice for
the prefetching algorithm discussed later in this work. To use the Kalman filter
algorithm for predicting, after the measurement update equations have been applied to
integrate a new data point into the estimated model, the time update equations can be
applied to the state variables to attain a new a priori estimate for the next state. This
estimation can then be used as a prediction for the next prefetch address.
The system covariance matrix Q and the observation covariance matrix R are treated
as static matrices that do not change over time in this description, but in fact these
matrices are not always stable in real life processes. For the sake of simulation of this
work however, these matrices will be kept constant because the estimation error
28
covariance matrix Pk and the Kalman gain matrix Kk will stabilize quickly and remain
constant under these conditions [7].
Since the Kalman filter is a very general algorithm, it was chosen for use as the digital
signal-processing element in this work. A simple modification to the Kalman filter's
internal parameters can make the filter predict in different manners, so an algorithm
that uses these filters as a basic logic processing block can potentially have a large
number of opportunities to use the same filter implementation and mechanism, and be
able to produce many different types of results by simply changing the internal
parameters of the filters. Another highlight of the Kalman filter is that it does not need
to store large amounts of data, since it is a recursive algorithm. This eliminates the
need for storing large data tables, which are present in many existing correlation based
prefetch algorithms today. In chapter 4, an algorithm is presented that treats the
Kalman filter as a black box prediction mechanism, and uses its predictive capabilities
to perform prefetching.
29
Chapter4
Algorithm Design
The algorithm presented in this work applies the principles of digital signal processing,
in particular digital signal processing using the Kalman filter, to analyze miss reference
streams and produce predictions that are used as prefetch addresses. The algorithm
encapsulates a given number of filters, each ofwhich having its own internal Kalman
filter parameters, such as its state model and covariance. Before discussing the general
algorithm design, the internal Kalman filter design chosen for this work is covered.
The Kalman filter was chosen as the internal filter for this algorithm because of its
flexibility and applicability in other areas of engineering and science [7]. The Kalman
filter algorithm can also be implemented in a pipelined fashion [26], so one must only
wait for the initial latency for the first calculation, and then a new calculation can be
produced each CPU cycle afterward. In fact, this prefetching algorithm could use
another type of filter or predictor as its internal prediction device. The filter is treated
as a black box with respect to the general algorithm design, and could be changed to a
different filter, or could even be a hybrid of Kalman filters and other types of filters
working in parallel.
For simplicity of this work, all of the filters used in the algorithm will be Kalman filters
that use a simple particle moving with constant velocity in a single dimension model.
30
Note that this model is not finely tuned for optimal performance; it simply serves as a
filter that is able to predict constant stride patterns. In order to come up with the
exact numbers, intuition and experimentation with various numbers was used. It was
found that the numbers selected produced a filter that was able to lock onto the
desired access pattern quickly and efficiently, and also had the property of correcting
itself quickly in the event of a disruption in the pattern. The specific Kalman filter
model used is:
SystemMatrices
Observation Matrix
A =
1 1
0 1
H=[l 0]
System Covariance Matrix Q =
l o
0 1
Observation CovarianceMatrix R = [.001 J
This model's system matrix resembles a particle with constant velocity. The two state
parameters are the position and the velocity of the particle, and with each time step it
is predicted that the position state variable will change by an amount given by the
velocity state variable. The observable parameter is the position of the particle, which
in this algorithm is analogous to the memory address of the data point. The velocity of
the particle is analogous to the stride in a typical stride prefetch scheme.
31
This filter model is designed to allow the internal filters to be able to recognize and
predict stride patterns in the cache accesses. Other filter parameters could be
substituted for these in order to make the filters be able to recognize more complex
miss reference patterns, but as mentioned earlier, the algorithm discussed in this work
will use these filter parameters and focus on predicting the stride-like patterns that
appear in the miss reference streams. A miss reference characterization of various
benchmark programs shows that stride patterns and next-block patterns (the next
cache miss occurs in the next cache block, or unit stride) account for a considerable
portion of the cache miss patterns in many benchmark programs [27]. In addition, the
success of stride prefetchers in general [28] contributes to the decision to use this
model for the internal Kalman filter for this algorithm.
The algorithm layout consists of three major logical blocks. The first is the control
logic block, which is responsible for watching the cache access patterns and de
multiplexing these cache accesses into data point inputs for the various filters. The
second logical block is the filter array, which receives input data points from the
control logic block. The third logical block is the prefetch logic block, which watches
the predictions made by the Kalman filters, and determines when to issue prefetches
based on whether or not the a given Kalman filter is accurately predicting its virtual
reference stream.
A block diagram showing the logical flow of the algorithm is shown in figure 4. 1 .
32
Cache access
Stream
Control
Logic
w
Filter
1
Prefetch
Logic
Prefetches to
insert into
cache miss
stream
W
^
Filter
2w
Filter
3w w
Filter
n W^
i i.
Filter prediction
feedback
Figure 4.1: General algorithm layout using n internal filters
As shown in figure 4.1, there are a variable number of filters in the filter array, which
introduces the first general algorithm parameter, the number of Kalman filters
(NKF) to use. Experimentation shows that the marginal return of each additional
filter used in the algorithm diminishes in terms of miss reference prevention (see
chapter 5). All of these Kalman filters that are internal to the algorithm are in one of
two states at any given time, available or waiting. All of the internal Kalman filters
begin in the available pool, and as they begin to listen to miss reference streams, they
enter the waiting stage where they will wait for miss reference addresses that are given
to them by the algorithm controller. As these points are received, they will process
33
them using their Kalman filter algorithms, and produce a prediction point that may or
may not be used as a prefetch, based on the success ofpast predictions.
The algorithm analyzes the miss reference stream generated by the misses from the
data level-one cache. The algorithm then in turn determines which, if any, internal
Kalman filter should receive this input point. The algorithmmust keep track of the last
data point that each internal Kalman filter has processed, along with the prediction it
made for the next data point. By dividing the aggregate miss reference stream into
these separate streams intended for each internal Kalman filter instance, the concept of
virtual streams is introduced. Each of these virtual streams is processed by a single
Kalman filter instance. Simple thresholding is used to determine whether or not a
point belongs to a particular internal Kalman filter. Thresholding is done direcdy on
the miss address itself. By using this sort of thresholding, it limits the effectiveness of
the algorithm by allowing it to only lock onto streams that have a fairly high degree of
spatial locality in the address space. However, having too large of a threshold can
introduce confusion of two different virtual streams that are close to each other in
terms of address space, causing the Kalman filters to predict incorrect values. This
introduces the second general algorithm parameter, the virtual stream thresholding
tolerance (VSTT), which is the number of bytes that an address point must be
within, plus or minus, with respect to the filter's previous point.
34
Each internal Kalman filter of the algorithm will continue to listen for miss references
that fall within its threshold values until it finds one. It is possible that an internal
Kalman filterwas allocated to listen for a particular virtual stream, but it is in fact not a
traceable miss reference virtual stream, but in fact just a statistical outlier. In this case,
it would be possible to have several internal Kalman filters that are hstening for points
they will never receive, and they are just beingwasted. To combat this anomaly, a third
general algorithm parameter is introduced, the recycling threshold (RT). The RT
specifies a number of CPU cycles to wait between virtual stream points before an
internal Kalman filter is disposed of and given back to the pool of available internal
Kalman filters. Specifying a number that is too low for this parameter will result in
internal Kalman filters that are following and predicting a valid virtual stream to be
recycled while they are performing useful work; a value too large will result in internal
Kalman filters waiting on invalid streams when there are valid streams that are being
rejected because there are no available Kalman filters in the pool.
In addition to the RT, a second parameter is provided that assists in cleaning up
internal Kalman filters that have been orphaned waiting on invalid streams. This
parameter is the recycling delta multiplier (RDM). This parameter works by
remembering the time difference (in CPU cycles) between each data point that each
internal Kalman filter received. Each time a miss reference is received by a particular
internal Kalman filter, the time delta in CPU cycles between the current cycle and the
cycle in which the previous reference was received is calculated. This time delta is
35
multiplied by the RDM and the result then acts similarly to the RT. At that point, if
this internal Kalman filter does not receive another miss reference before this time
span expires, then it will be disposed of and returned to the available pool. This
parameter is designed to clean up Kalman filters with greater speed when they are
locked onto a stream that is producing miss references at a particular rate, and then all
of a sudden stops producing. It is designed to allow for disposal of a Kalman filter that
has recently become invalid to be cleaned up earlier than those that must wait for the
RT to expire. Note that if a filter has just been added to the waiting pool from the
available pool and only has a single miss reference in its history, it is not subject to any
RDM calculations.
The algorithm control unit is the logical device that listens to the D-Ll cache miss
references and determines what to do with mcorning miss reference addresses. When a
miss reference is received (a data point), the list of waiting internal Kalman filters is
examined, and a comparison is made to determine whether or not this point falls into
any of the waiting Kalman
filters' VSTT. The first filter that is found for which the
thresholding check passes receives this miss reference as its next data point. At this
point, the Kalman filter will operate its algorithm on the data point based on its
internal parameters and produce a prediction. If none of the filters in the waiting Ust
pass the thresholding check, the list of available filters is examined. If there are any
entries in the available list, then the first filter in the Ust will be selected and moved to
the waiting pool. The examination of the waiting list and available lists occur in
36
parallel, such that each filter is examined at once in the same cycle. Priority logic then
determines which filter to select in the event that more than one filter meets the
criteria, for instance the case when a waiting filter can accept this point and there is
also an available filter that can accept this point. In this event, the waiting filter will be
issued the point. This filter will also receive this miss reference as its first data point.
Finally, if none of the filters in the waiting pool match, and there are no filters in the
available pool, then the miss reference is discarded. This process is similar to the way
in which a translation lookaside buffer (TLB) is accessed to determine which page of
memory that the memory address resides in. All of the entries are queried in parallel
except that instead of searching for matches, it would be a combination of two
comparison operations to determine if the miss reference is within a range dictated by
the filter's VSTT. A flowchart detailing this logic is shown in figure 4.2.
37
Recycle filter Y, if
necessary.
Assign pointX to
filterY
Move filter Y to
waiting list.
Assign pointX to
filter Y
rReceive cache missreference pointX
Yes
Yes
Check list ofwaiting
filters
No
Check list of
available filters
No
Discard pointX
Figure 4.2: Flowchart of control unit's logicwhen receiving a miss reference
The algorithm prefetch unit listens to predictions made by the Kalman filters. For each
internal Kalman filter, the prefetch unit remembers the last prediction that each
Kalman filter made. It also listens to the D-Ll cache hits as well as the cache misses to
38
check for instances when the Kalman filter made a correct prediction. When a cache
hit is encountered, the prefetch unit compares this value to all of the waiting Kalman
filters' last prediction value. If the cache hit is contained within the cache block that
one of the Kalman filters had previously predicted, then this is noted as a successful
prediction, and this cache hit value is given to that Kalman filter for its next data point.
In the optimal case, when a Kalman filter is in the process of correctly predicting a
given virtual stream, it should be doing it by receiving a series of cache hits that it had
previously predicted, issued prefetches on, and brought the values into cache before
they were needed. In addition, in the case that the miss reference coincides with an
address that was prefetched due to a given Kalman filter prediction but is not yet in
cache, then this is also considered a successful prediction, and this miss reference is
forwarded to the Kalman filter as its next data point. Not all of the memory latency is
hidden in this case, but some of the latency has been hidden.
In order for the prefetcher to be able to recognize cache hits as successful predictions,
the prefetch unit must forward these prediction addresses to the control logic unit,
which is exarriining the cache reference stream. When a prediction is made, then the
control unit will receive a message from the prefetch unit to begin kstening for cache
hits on a certain cache block for a certain filter. As cache hits are received, they are
compared with this list of cache blocks that were predicted. If there is a match, then
this point is forwarded to the appropriate filter. However if the control unit determines
that a miss reference belongs to a particular filter that is NOT in the cache block that
39
was previously predicted by this filter, then this prediction value will be cleared from
the control unit and will be replaced by the new prediction value generated by the filter
after the filter's computation latency. A flowchart detailing the logic for when a cache
hit is encountered is shown in figure 4.3.
Assign pointX to i^,
filter Y
Yes
Receive cache hit
reference point X
Check list of
Dending prefetches
Figure 4.3: Flowchart of control unit's logicwhen receiving a cache hit
To determine when it is an appropriate time to issue a prefetch, the Kalman filtermust
prove to the prefetch unit that it can successfully predict. The algorithm does not issue
any prefetches from a given Kalman filter until the Kalman filter has had at least one
successful prediction. A normal sequence of events that would happen to a given
Kalman filter in a successful prediction cycle would be:
40
1. A miss reference is received, and Kalman filter A is assigned to this virtual stream.
This miss reference is given to the filter as its first data point.
2. A second miss reference is received, and it is within the VSTT ofKalman filter A.
This reference is forwarded to this Kalman filter as its second data point.
3. The difference between the first point and the second point is calculated, and that
initializes the velocity parameter of the Kalman filter. The position parameter is
initialized to the current input point. The Kalman filter now has a fully initialized
set of state variables.
4. A third miss reference is received, and is within the VSTT ofKalman filter A. This
point is forwarded to the Kalman filter, and the filter operates its algorithm on the
point, and updates its internal state variables.
5. The Kalman filter produces a prediction for the next point it will receive by
processing its time-update equations with its internal state variables to predict the
state variables for the next iteration. The position data point is issued as a
prediction.
6. A fourth miss reference is received by Kalman filter A. This miss reference is
contained in the same cache block as the prediction that it had previously issued.
The Kalman filter algorithm processes the point, and another prediction is made.
41
7. Since the previous prediction had been a successful prediction, the prefetch logic
will issue a prefetch against the filter's next prediction in confidence.
8. A cache hit is detected that falls within the cache block that was prefetched by
Kalman filter A. This point is given to the Kalman filter for processing, and its
prediction is again issued as a prefetch.
9. This continues until either the Kalman filter receives a miss reference that was
NOT in the cache block previously predicted, or it is recycled due to a timeout by
either the RT or RDM.
In the event that the filter receives an input point that was not from a successful
prediction, then the prediction logic resets and the prefetching will stop until the filter
is able to correcdy predict again. This works to rninirnize false prefetches that would
waste memory bandwidth and pollute the cache, by only prefetching when the filter is
in a successful prediction cycle.
To implement the control logic unit and the prefetch unit, very little hardware is
needed per filter, even in a very simple implementation. The control logic unit requires
two registers to hold the last input address and the last prediction address, a bit to
indicate the state of the filter (available or waiting, noted as W in figure 4.4), and two
comparators to determine if an address is within the VSTT of the filter. If the VSTT is
a power of two, a simple bit shift operation can be used to calculate the upper and
42
lower bound for the VSTT comparison operations. The filter-recycling infrastructure
requites a counter to determine how many CPU cycles have elapsed since a waiting
filter has received its last input point, a register to hold the cycle count at which the
filter will be recycled, a multiplier to determine the next RDM product (if the RDM is
a power of two, a simple bit shift operation will suffice), and a comparator to detect
when the filter has been in the waiting state for enough cycles to be recycled. The
prefetch logic unit simply requires a single bit that represents whether or not the
previous prediction was correct (noted as P in figure 4.4). This bit is set by the control
logic when an input value matches the value in the last prediction register exactly.
Figure 4.4 shows a simplified implementation layout diagram showing the control unit
logic to determine when an input point should be used, and the prefetch logic to
determine when to invoke a prefetch. Not all of the connections are shown, only the
major hardware components and the main data flow paths.
43
Cache reference address
w Last Input
Shift by
VSTT
Shift by
VSTT
<
To prefetch
logic
Filter u
IF
Last Prediction
J
F I
Take input
point
Figure 4.4: Simplified implementation diagram of control logic and prefetch logic
In the next chapter, a trace-based simulation is performed to evaluate the algorithm in
terms of accuracy and effectiveness.
44
Chapter5
Algorithm Evaluation
This chapter presents an evaluation of the potential performance of the algorithm.
Prior to the research performed in this work, a set of cache miss references was
generated for a variety of programs by other researchers. In addition, a tool with a
GUI interface was also created by the author of [13] that allowed Kalman filters
with various parameters to be evaluated against the input streams. However, due to
the very large nature of the traces, this tool was inefficient in simulating the filter
on the traces. In response to this, two different evaluation environments were
created.
The traces that were created were produced by a modified SimpleScalar [15] CPU
simulation suite running as sim-outorder that was enhanced to output the memory
reference of a cache miss alongwith the CPU cycle in which it occurred for various
benchmark programs. The traces contained information on data cache hits and
misses, instruction cache hits and misses, and the unified level two cache hits and
misses. Since this algorithm is focused on the data level-one cache misses for this
work, the traces were filtered to only include these miss references. Also, cache
reads and writes are treated the same for the evaluation phase, since either could
result in a cache miss that could have been prefetched. There were two sets of
streams available - one was the set of cache miss streams from the benchmark
45
programs compiled for the Alpha21264 architecture with compiler optimization
level four (the most aggressive optimization level available for Alpha21264), and
the other was the set of cache miss streams for the PISA architecture with compiler
optimization level three (the most aggressive optimization level available for PISA).
Each of these architectures was evaluated so that they could be compared. The
main difference between these two architectures is that the Alpha21264
architecture includes a PREFETCH compiler instruction, whereas the PISA
architecture does not.
The first evaluation environment was implemented in Matiab [29] because of the
ease of design and algorithm modification, as well as the intrinsic availability of
graphing tools. The Matlab environment still had problems producing results for
very large traces due to memory constraints, so a second evaluation environment in
C++ was created. This environment was able to produce figures on what
percentage of prefetches would have been correct out of the total number of
prefetches, and what percentage of the total data level one cache misses could have
been prefetched. The Matlab environment was used in the algorithm design and
verification phases, and the C++ environment proved most useful in the
evaluation of the algorithm once it had been developed in the Matlab environment.
46
The algorithm was evaluated against a broad set of benchmark programs to get a
wide range of performance measurements for different types of applications. The
programs that were used to evaluate the algorithm are given in table 5.1.
Program Description Benchmark Suite
Adpcm_encode Speech compression MediaBench
Adpcm_decode Speech decompression MediaBench
Applu Parabolic/Elliptical partial cliff, eqs. SPECFP2000
Apsi Meteorology: pollutant distribution SPECFP2000
Art Image Recognition/ Neural Networks SPECFP2000
Bh Gravitational system Olden
Bisort Sorting Olden
Bzip2 Compression SPECINT2000
Em3d Electromagnetic wave modeler Olden
Epic_encode Image compression MediaBench
Epic_decode Image decompression MediaBench
Equake Seismic wave simulator SPECFP2000
G721 encode Speech compression MediaBench
G721 decode Speech decompression MediaBench
Gcc C Compiler SPECINT2000
Ghostscript Postscript interpreter MediaBench
Gsm encode Speech compression MediaBench
Gsm decode Speech decompression MediaBench
Gzip Compression SPECINT2000
Health Columbia health system simulation Olden
Jpeg_compress Image compression MediaBench
Jpeg_decompress Image decompression MediaBench
Mcf Combinatorial Optimization SPECINT2000
Mesa osdemo 3D Graphics MediaBench
Mesa_mipmap 3D Graphics MediaBench
Mesa_texgen 3D Graphics MediaBench
Mesa 3D Graphics SPECFP2000
Mpeg2_encode Video compression MediaBench
Mpeg2_decode Video decompression MediaBench
Mgrid Multi-grid solver SPECFP2000
Mst Minimum spanning tree Olden
Parser Word processing SPECINT2000
Perimeter Calculate perimeter or raster images Olden
Power Power system optimization Olden
Swim Shallowwave modeling SPECFP2000
Treeadd ' Add values in a tree Olden
47
Tsp
Vpr
Vortex
Traveling-salesmen problem
FPGA Placement& routing
Object OrientedDatabase
Olden
SPECINT2000
SPECINT2000
Wupwise Physics/Quantum Chromodynamics SPECFP2000
Table 5.1: Benchmarks used
Since the evaluation phase of the algorithm testing only makes use of pre-generated
cache miss reference streams, there is no way to determine the effect that the
algorithm has on execution time. However, the metrics that can be measured in this
type of test are the accuracy of the algorithm and the overall effect of the algorithm on
reducing cache misses. For programs that are bound by the cache, this should lead to a
reduction in execution time. There are however some programs that are
computationally bound or bound by some factor other than the cache, so even a
significant reduction in cache misses will still not have a significant reduction in
execution time. For the purposes of this evaluation, the metrics measured are the
percentage of correct prefetches with respect to total number of prefetches issued,
which measures the algorithm's accuracy, and the percentage of cache misses correctiy
prefetchedwhich measures the algorithm's effectiveness in reducing cache misses.
The algorithm's design has four general operating parameters, once the internal
parameters for each Kalman filter have been designed. Assuming the Kalman filter is
at this point a black box, the algorithm needs its NKF (number of Kalman filters),
VSTT (virtual stream thresholding tolerance), RT (recycling threshold), and RDM
(recycling delta multiplier) specified. Since the NKF is by far the parameter which
impacts the design, complexity, and ease of implementation the most, the evaluation
48
will vary this parameter to determine its effect, and the other parameters will be
assumed at reasonable values which show good performance.
The chosen operating parameters for the algorithm are shown in Figure 5.2.
Algorithm Parameter Value
Kalman filter parameters
A =
"1 f
0 1
H=[l 0]
Q =
"l
0 1_
R = [.001]
NKF 4, 8, 16, 32 [Parameter being varied]
VSTT 256 bytes
RT 1,000,000 cycles
RDM 5
Table 5.2: Algorithm evaluation parameters
All of the following results were generated from the C++ algorithm tool. The Matlab
evaluation toolwas used primarily to determine reasonable values to use for the above
parameters.
First, the Alpha architecture was evaluated using the algorithm evaluation tool. All 40
of the benchmark traces were run through the tool to determine the accuracy and
49
effectiveness in reducing cache misses. Figure 5.3 shows the overall accuracy of the
algorithm (in terms of the percentage of the total number of prefetches issued that
were correct) and figure 5.4 shows the overall effectiveness of the algorithm (in terms
of the percentage of total number of miss references in the program that were
correctly prefetched).
120
//** # *s*f///S/ //*&/ *tf/SW/ *///*/vy
IT *'r g'f *
Program
? 4 Fitters
H 8 Filters
D 16 Filters
a 32 Filters
Figure 5.3: Accuracy of the algorithm for theAlpha ISA
50
//-
Program
? 4 Fitters
O 8 Fitters
H 16 Fitters
H 32 Fitters
Figure 5.4: Effectiveness of the algorithm for theAlpha ISA
One observation that can be made for this data is that on a whole, the algorithm is
fairly accurate in the prefetches it predicts. Most of the prefetches that were issued
were correct in the sense that the processor accessed that memory location at a later
time. Of course, due to the nature of the algorithm it would be impossible to achieve
100% accuracy, because the prefetcher will continue to issue prefetches until it misses
a prediction. Even if all of the virtual streams in an execution were completely
predictable, there will still be a prefetch issued that is not counted as correct resulting
from the final miss reference from each virtual stream. By not counting this final miss
reference against the prefetcher's accuracy rating, it would be theoretically possible to
achieve 100% accuracy for a very well behaved memory access pattern. The accuracy
51
values for the four different NKF values which are shown in figure 5.3 are
summarized in table 5.5.
NKF Average % of correct prefetches
out ofprefetches issued
Standard deviation ofcorrect
prefetches out ofprefetches issued
4 79.2% 25.5%
8 78.4% 25.9%
16 76.9% 26.4%
32 75.7% 26.9%
Table 5.5: Accuracy of algorithm forAlpha ISA
As the results dictate, the overall accuracy of the algorithm decreases as more filters are
used. This can be explained by the fact that when there are more filters available, there
are more opportunities to make prefetches on streams that might not be very well
behaved with respect to the internal Kalman filter parameters. This can increase the
number of prefetches issued that turn out to be incorrect. When there are fewer filters
available, there is a higher probability that all of the filters are locked onto valid virtual
streams and are issuing good prefetches, and the memory references that are members
of badly behave virtual streams are being discarded. However, when the algorithm is in
a state in which all of its filters are busy, there is also the risk ofmissing virtual streams
that could be predicted very well. The main negative impact that the reduction in
accuracy has is the additional burden it places on the memory bus to fetch memory
values that will never be used, and the strain it places on the cache to store cache
blocks that will never be used, and potentially displacing useful cache blocks. While the
52
evaluation process does not take into account any of these effects, the simulation step
of the testing process does, which is discussed later in chapter 6.
The second performance metric is the effectiveness of the algorithm, or the number of
memory references that were correctly prefetched. This metric demonstrates the
algorithm's ability to reduce cache misses, or at least hide some of the latency of a
cache miss. The effectiveness is shown in figure 5.4 and summarized below in table
5.6.
NKF Average % ofmemory references
thatwere correctly prefetched
Standard deviation ofmemory
references thatwere correctly
prefetched
4 19.4% 21.3%
8 23.0% 23.0%
16 26.7% 24.9%
32 29.4% 26.4%
Table 5.6: Effectiveness of algorithm for Alpha ISA
When the algorithm is evaluated for effectiveness in reducing cache misses, it is found
that the effectiveness increases with NKF on average. This is because many programs
are able to take advantage of the additional filters by picking up additional virtual
streams. While there may be an overall decrease in algorithm accuracy with an increase
NKF, the overall increase in cache miss reduction outweighs the accuracy reduction.
53
There is a wide range of effectiveness results for the 40 benchmark programs. The
algorithm performed very well on some programs (equake, g721, epic) and performed
very poorly on others (bh, health, vpr). This is indicative of the access patterns that
exist in these programs. Since the algorithm is using filters designed to pick up on
stride patterns, it is natural to assume that programs that access memory in stride
patterns will have a better performance when using this algorithm. In fact, if an
average is taken over the top 10 performing programs, the average effectiveness is
62.3%.
It is also apparent from the results which of the programs are bound by the NKF. For
programs in which the effectiveness jumps sharply between each NKF value (epic,
gcc, jpeg, swim), it indicates that these programs have many virtual streams which can
be predicted, and will benefit from having more internal filters. For some other
programs, the effectiveness changes very little between variations in the NKF (adpcm,
equake, treeadd). This is indicative of a program in which there are only a few main
virtual streams, and they can be easily predicted using a fewer number of internal
filters.
Based on these results, the simulation should show that for the programs that the
algorithm has a high effectiveness there should be a higher chance that there will be a
reduction in execution time (there will only be a significant reduction in execution time
if the process is memory bound). For programs that showed poor dgorithm
54
effectiveness, the simulation should follow by not showing a significant reduction in
execution time. Finally, since these evaluations were performed using the highest
compiler optimization aggressiveness available and the Alpha ISA has a PREFETCH
instruction designed to hide memory latency already, the execution time reduction may
be even further reduced.
To contrast the Alpha ISA with an ISA that does not perform software based
prefetching, the PISA ISA was also evaluated for the same benchmark programs. This
ISA does not contain a PREFETCH instruction, so it may show different simulation
results. Figures 5.7 and 5.8 show the accuracy and effectiveness of the algorithm for
PISA, respectively.
&
Program
Figure 5.7: Accuracy of algorithm for PISA
55
/ XV .e.y ,r,/ .N/.N/ a^ Jt XV cF J? ~, ~ , &' n.sft./ <P ~^rjr rr f$
Program
? 4 Filters
? 8 Filters
M 16 Filters
O 32 Filters
Figure 5.8: Effectiveness of algorithm for PISA
The results for the PISA evaluations are similar to the results for the Alpha evaluations
in the fact that many of the programs show a good accuracy and effectiveness. Many
of the programs performed similarly between the two architectures; only a couple
programs performed well under one of the architectures but poorly under the other.
Nearly 70% of the miss references in wupwise were correcdy prefetched under the
PISA architecture with an NKF of 32, but the analogous Alpha test only successfully
prefetched around 6%. Since the architectures are different, this was expected in some
of the traces. What is important is the overall accuracy and effectiveness of the
algorithm, and how those values translate into performance enhancement during the
simulation phase as will be demonstrated in chapter 6.
56
The overall accuracy of the algorithm evaluated using the PISA architecture is given in
table 5.9.
NKF Average % ofcorrect prefetches
out ofprefetches issued
Standard deviation ofcorrect
prefetches out ofprefetches issued
4 83.9% 23.7%
8 82.8% 24.6%
16 81.9% 25.1%
32 81.0% 25.5%
Table 5.9: Accuracy of algorithm for PISA
The accuracy of the algorithm using the PISA architecture follows the same patterns as
identified in the Alpha architecture. As the NKF increases, the accuracy decreases
overall. This is again due to the presence ofmore filters, leading to more prefetching
on virtual streams that may not be well behaved, pulling the overall accuracy down.
The PISA evaluations showed a generally higher accuracy, about 5% higher for each
NKF value. This could be due to a number of factors, including inherent differences in
the architectures or the presence vs. the absence of the PREFETCH instruction. The
presence of the PREFETCH instruction in the Alpha ISA may be detrimental to the
accuracy of the algorithm, because the algorithm pays no attention to the nature of the
instruction stream. If there is a virtual stream whose cache misses are being generated
by PREFETCH instructions, then it is following a miss reference stream that is already
compensated for prefetching due to interference from the compiler. If the compiler
inserts PREFETCH instructions that predict incorrecdy, then it could disrupt the
57
operations of this hardware based prefetcher due to a miss reference stream that is not
accurately portraying the access patterns of the execution. The simulation numbers are
better able to support this concept because they measure the actual execution time
rather than just the proportion of cache misses prevented. Table 5.10 summarizes the
effectiveness of the algorithm for PISA.
NKF Average % ofmemory references
thatwere correctly prefetched
Standard deviation ofmemory
references thatwere correctly
prefetched
4 27.1% 28.6%
8 32.5% 30.1%
16 38.2% 32.3%
32 40.9% 32.8%
Table 5.10: Effectiveness of algorithm for PISA
An initial observation from the PISA numbers show that the algorithm is much more
effective in reducing cache misses for the PISA architecture than the Alpha
architecture. On average, it is about 10% more effective for the PISA programs. A
subset of the top 10 performing PISA programs using an NKF value of 32 have an
average reduction in cache misses of 82%.
An observation about the nature of the programs that seem to perform well is that
many of the programs that have a high percentage of total memory
accesses correcdy
prefetched come from the MediaBench [33] set of benchmark programs. Many of
these programs deal with compression and decompression of large data sets. Since
58
many algorithms that perform these sorts of operations involve accessing long
sequences of linear data, this prefetcher is well suited for aiding these types of
programs in their execution. This is of course expected due to the design of the
internal Kalman filters.
The above evaluations only vary the NKF parameter to a value as high as 32. If the
NKF parameter is increased to larger values, the accuracy continues to decrease but
the effectiveness increases at a slower rate, approaching a plateau. The algorithm was
evaluated again with larger NKF values to measure the effects of the increasingNKF.
NKF Average % ofmemory references
thatwere correctly prefetched
Standard deviation ofmemory
references thatwere correctly
prefetched
64 31.2% 27.1%
128 32.3% 27.2%
256 32.9% 27.2%
Table 5.11: Effectiveness of algorithm for Alphawith large NKF values
For the Alpha architecture summarized in table 5.11, increasing the NKF value by a
power of two has a dirninishingmarginal return, and the small increase in performance
is coupled with a large increase in implementation complexity. An NKF of 32 is
chosen as the maximum value for the simulations in chapter 6, because the most
benefit is seen from the first 32 filters, and another 32 filters only shows a very small
marginal benefit.
59
NKF Average % ofmemory references
thatwere correctly prefetched
Standard deviation ofmemory
references thatwere correctly
prefetched
64 42.7% 32.5%
128 44.7% 32.6%
256 46.1% 33.1%
Table 5.12: Effectiveness of algorithm for PISAwith large NKF values
The PISA exhibits similar behavior to the Alpha (see table 5.12), except the marginal
return is still slightly larger than it was for the comparable NKF value on the Alpha
architecture. If the implementation details were not a design issue, then adding larger
numbers of filters to the array would be a feasible idea, but there is a limit in how
much performance can be attained. The maximum NKF value for the PISA
architecture is also chosen to be 32, for the same reasoning as the Alpha NKF
parameter.
60
Figure 5.13: Effectiveness of algorithm vs. NKF
Figure 5.13 shows the effectiveness of the algorithm as the NKF value increases. Each
point on the x-axis represents twice as many filters as the previous point, so each point
represents a large increase in implementation costs. The figure shows that at an NKF
value of 32, there is a knee in the curve, where the marginal benefit from doubling the
NKF decreases. This supports the choice for selecting 32 as the maximum value for
NKF for the simulations in chapter 6.
Overall these evaluation results are promising, and show that the algorithm can
successfully achieve a good degree of accuracy and effectiveness in preventing cache
misses. A shortcoming of many proposed prefetch algorithms is that they only
measure their accuracy and effectiveness in terms of cache misses prevented. This is a
useful metric but the impact on execution time is the bottom line. If the correct
prediction of cache misses cannot benefit execution time, then the algorithm has not
61
been shown to be useful. In the next chapter, the algorithm is simulated in an
execution driven mode by using the SimpleScalar [15] CPU simulator. This allows a
measurement of the execution time for these same benchmarks, because the simulator
executes the program code with the prefetch algorithm working at runtime to prefetch
memory addresses.
62
Chapter6
Simulation
This section presents the results from the SimpleScalar [15] simulations for the 40
benchmark programs. The goal of the simulations is to get a measure of execution
time reduction. In the previous chapter, the metrics presented dealt only with the
elimination of cache misses and the accuracy of the algorithm in predicting the correct
values to prefetch. This does not guarantee that the execution time will be reduced. If
the program execution does not spend a significant amount of time stalled and waiting
for memory accesses, then there may be little or no benefit of this prefetch algorithm,
even if the algorithm can effectively prefetch every cache miss before it is needed.
Evaluationmetrics
To get a measure of how effective the algorithm is in reducing execution time for a
benchmark program, the result of each algorithm simulation is compared to a
simulation of the same program in which no prefetching has been applied. However,
there are many programs in which no matter how effective the algorithm is in
prefetching data blocks, the speedup will be negligible. To determine which
benchmarks have a significant potential for speedup, the same benchmark programs
are also executed on a modified SimpleScalar platform in which all data level one
accesses are configured to be cache hits. This simulates a program execution running
63
with an ideal prefetcher, which always retrieves the correct data into cache on time,
and never replaces cache blocks that the processor is using. The execution time for this
configuration represents an upper bound on the effectiveness that any prefetch
algorithm could have on the program. The set of ideal execution times has been
generated, but not published, by other researchers and has been provided for use to
this work.
To measure the reduction in execution time, the metric used is the percent speedup for
the execution. Speedup is the percentage reduction in the execution time, and is
defined in this work as:
n . OriginalExecutionTime - NewExecutionTimeSpeedup = * 100% (6.1)OriginalExecutionTime
From the formula, it is clear that as the new execution time approaches the original
execution time, speedup will approach zero, and when the new execution time
approaches zero, speedup approaches 100%. For all of the execution times discussed
in this work, the unit of measurement is CPU cycles where the frequency of operation
for all processors is the same.
Integrationwith SimpleScalar
To get a measure of the execution times for the benchmark programs using the
algorithm proposed in this work, the SimpleScalar [15] tool set is used with
64
modifications to implement the algorithm. The SimpleScalar tool set comes with
several different types of simulation programs such as trace-driven simulators and
execution-driven simulators. This work uses the sim-outorder program which is
provided with the SimpleScalar tool set, because it simulates an aggressive superscalar
out-of-order issue microprocessor, and the program is actually executed for the
simulation in the same manner a realmicroprocessor would execute it. This is the most
appropriate simulator for this work because the main metric of interest is execution
time.
The algorithm that was implemented for the C++ tool for evaluations in chapter 5 is
designed in a way such that it is easily integrated into the SimpleScalar code. The
algorithm code is able to initialize itself given certain runtime parameters (such as
NKF, VSTT, etc.), and then operates by taking an input point from the calling
program and returning a prediction point alongwith a flag determining whether or not
to issue a prefetch with this prediction.
The logic to integrate this algorithm was placed in the cache subsystem module of
SimpleScalar. In the function that responds to a cache access, if there is a cache miss
on the data level one cache, then the algorithm is invoked with the cache miss address.
Based on the return value of the algorithm, a prefetch may or may not be issued. If a
prefetch is issued, then the cache access routine is recursively invoked a second time
effective one cycle in the future (assumed latency to produce the prediction) with a
65
special flag indicating that this is from the algorithm's prefetch mechanism. The
recursive execution of this function mimics the addition of the prefetch into the
memory reference stream by requesting the prediction value from the data level one
cache. If the value is already in the cache, then nothing happens. If it is not in the
cache, then the normal cache logic of accessing the data from the next level of cache
or main memory is invoked, which implements the actual prefetch. In the event of a
cache hit, the point is also given to the dgorithrn, and the list of waiting filters is
examined (done sequentially in the code, but would be done in parallel in a hardware
implementation). If the point is within the VSTT of any one of the waiting filters then
that point is issued to that filter and the same logic that issues prefetches will be
invoked if necessary. Figure 6.1 shows the sequence of steps taken on a cache access.
66
( Cache access at address X
No
Give point X
to prefetch
algorithm. Get
prediction
point Y.
No
Recursively call
this function
with point Y,
but do not re-
invoke
algorithm.
Service cache miss
Figure 6.1: Flowchart of SimpleScalar logic to implement the algorithm
67
Simulated system
The architecture simulated is configured to be as close to the true Alpha 21264
architecture [36] as SimpleScalar allows. The configuration parameters are:
Parameter Value
Instruction fetch queue size 16
Branch mis-prediction latency 16 cycles
Branch prediction policy 2 level branch prediction mechanism, Combination of local
and global predictors
Branch predictor level 1 16384 entries, 16 history entries
Branch predictor level 2 16384 entries, 16 history entries
Branch predictor chooser 65536 entries
Branch target buffer 4096 entries, 2 way set associative
Decode unitwidth 8 bytes
Issue unit width 8 bytes
Commit unitwidth 11 instructions
Register update unit size 128
Load/store queue size 64
Level 1 data cache 1024 entries, 2 way set associativity, 32 byte cache line, LRU
replacement policy, 3 cycle hit latency
Level 1 instruction cache 1024 entries, 2 way set associativity, 32 byte cache line, LRU
replacement policy, 3 cycle hit latency
Level 2 cache 32768 entries, direcuy mapped, 32 byte cache line, LRU
replacement policy, 12 cycle hit latency
Memory latency 80 cycles for first chunk, 2 cycles for each additional chunk
ITLB 128 way associative with block size 8192. 30 cycle hit latency.
DTLB 128 way associative with block size 8192. 30 cycle hit latency.
Integer ALUs 4
Integer multipliers 1
Memory system ports 2
Floating pointALUs 2
Hoating pointmultipliers 1
Table 6.2: Architecture configuration used for the simulations
Both the Alpha simulations and the PISA simulations use the same microprocessor
architecture setup parameters as shown in table 6.2, the only difference being the ISAs.
68
Results
The simulations were run on the benchmark programs for the Alpha 21264
architecture and the PISA architecture. The Alpha architecture has five levels of
compiler optimizations numbering zero thru four. The PISA architecture has four
levels of optimization, numbering zero thru three. All of the optimization levels were
simulated in order to see the effects of the code optimization level on the algorithm
performance, and those results are discussed later in this chapter. All of the results
presented up until then assume the highest optimization level available for its
architecture; level four for Alpha and level three for PISA. This work assumes that
programs that are generally used in production environments are compiled with the
most aggressive optimizations available for performance reasons, so these programs
will be of themost importance in deterrnining the effectiveness of the algorithm.
Since the benchmark programs simulated include a wide variety of programs, many of
which with a very low potential speedup even with an ideal prefetcher, the benchmark
programs are divided into two groups to examine the overall average speedup. The
two groups are divided as follows: the first group contains the programs in which the
ideal prefetcher simulations resulted in a speedup greater than or equal to 10%, and the
second group contains the programs in which the ideal prefetcher simulations resulted
in a speedup of less than 10%. The programs for the Alpha
ISA that had a potential
speedup ofgreater than 10% are shown in figure 6.3.
69
?.$=
rTfgi- ffiM
S =
^K^
applu apsi art bzip2 em3d gco health mcf mst swim treeadd vpr arage
Program
Figure 6.3: Speedup onAlpha benchmarkswith potential speedup greater than 10%
When considering only programs which have a potential speedup ofgreater than 10%,
there are still some programs which show little to no improvement This is indicative
of the inherent memory access patterns of the programs themselves. However, some
programs do show a significant speedup (art, swim, treeadd), which indicates that the
Kalman filter parameters used in this design of the algorithm are appropriate for the
memory access patterns for those programs.
Simulation Average Speedup Average Speedup
relative to ideal
NKF = 8 3.8% 11.6%
NKF= 16 6.2% 17.2%
NKF = 32 6.5% 17.7%
Ideal prefetcher 39.0% N/A
Table 6.4: Average speedups forAlpha benchmarks with potential speedup greater than 10%
70
As shown in table 6.4, for NKF = 32 the algorithm achieved about 17.7% of the
potential speedup that could be attained if every memory reference was prefetched
perfecdy. Overall this corresponds to a 6.5% raw speedup, where the ideal prefetcher
produced a speedup of 39.0% for the same benchmark programs. One important fact
to notice is that an increasing NKF parameter yields a significant marginal return
between NKF = 8 and NKF =16, but a much smaller return between NKF =16 and
NKF = 32. This trend shows that there is a dimirushing marginal return for each filter
added to the algorithm, and the algorithm can achieve a high percentage of its
effectiveness with a small amount of filters.
While the algorithm does produce a speedup for those programs that have a high
potential for speedup, it is also important to examine what the algorithm does for
programs that do not have a high potential for speedup. The remainder of the
programs simulate are shown in figure 6.5.
71
3.00%
2.50%
2.00%
1.50%
en
* 1.00%
0.50%
0.00% rm , , raj
; ! E !
; i !
1 : :
1 : :
: : : ;
: ! :::
J
Q NKF = 8
? NKF= 16
|gNKF = 32[
J?Ej .rra,, SI
? J?
<y \
Program
<J* <r
Figure 6.5: Speedup onAlpha benchmarkswith potential speedup less than 10%
The benchmark programs with a potential speedup less than 10% show a much
smaller average speedup, less than half a percent. It is important to note that the
algorithm did not have a significant negative impact on these programs that show little
potential. The largest negative speedup that is observed is -0.01% for jpeg_compress.
The negative speedup is the result of inaccurate prefetching and cache pollution, but is
shown to not have a significant negative impact on the execution of the program.
The Alpha simulations have shown that the algorithm can produce a speedup for
several programs, but the Alpha ISA and the PISA ISA are different, including the
additional PREFETCH instructions in the Alpha ISA. To see the differences between
72
the Alpha and the PISA performances, the same simulations were performed for the
PISA architecture. The speedups for the PISA benchmark simulations that had a
potential speedup greater than 10% are shown in figure 6.6.
i.
CO
* 10 00% J-
9$
2s
&& -^fl m &
bzip2 em3d gcc health mcf mpeg2 mst parser swim treeadd \pr a\erage
encode
HNKF = 8 ;
IS NKF = 16 1
'BNKF = 32i
Program
Figure 6.6: Speedup on PISA benchmarkswith potential speedup greater than 10%
The PISA plot is similar to the Alpha plot in the fact that some of the program
executions did in fact benefit from the algorithm, but many still did not show a large
speedup. Again, there are many programs that saw little to no benefit, and again this
was expected due to the nature of the prefetcher algorithm and the fact that not all
programs are suited for this particular algorithm. Like for the Alpha simulations, art
and swim showed a large improvement, but treeadd showed a much smaller speedup
73
relative to the Alpha simulations. Table 6.7 summarizes the speedup results for the
PISA simulations.
Simulation Average Speedup Average Speedup
relative to ideal
NKF = 8 4.9% 14.8%
NKF = 16 5.2% 16.3%
NKF = 32 5.6% 17.8%
Ideal prefetcher 38.3% N/A
Table 6.7: Average speedups for PISA benchmarkswith potential speedup greater than 10%
On average, the PISA architecture experiences a smaller speedup from the algorithm
than the Alpha simulations did. In the evaluation section in chapter 5, a different
phenomenon was observed. The algorithm had a greater success in reducing the
amount of cache misses for the PISA architecture than the Alpha on average, but this
did not lead to a greater average speedup for those programs which have a high
potential speedup. This shows that even though the effectiveness of the algorithm in
reducing cache misses and the speedup a program will receive from the algorithm are
not directly correlated.
Figure 6.8 shows the actual speedup for the PISA programs that did not have a
significant speedup potential:
74
7.00%
6.00%
5.00%
4.00%
3.00%
1.00%
0.00% Jffl &, ffl
w^w^^wy^w/^^/ -w
r vrvr
Program
?a?^-^
JO NKF = 8
QNKF= 16
D NKF = 32
Figure 6.8: Speedup on PISA benchmarkswith potential speedup less than 10%
Much like the results for the Alpha simulations, many of these programs show a small
speedup if any, but more importantly the algorithm did not negatively affect the
program executions in any significant way.
Based on the results, a recommendation for the NKF value in an implementation of
this prefetcher algorithm is 32. The Alpha simulations saw a vast improvement as the
NKF varied between 8 and 16, and slighdy more at 32, however the PISA architecture
did not see as significant an improvement between 8, 16, and 32. The law of
diminishing marginal returns is in effect for the NKF parameter, and as NKF
continues to rise, there is a large increase in implementation cost for a very small
75
corresponding performance gain. As recommended in chapter 5, an NKF value of 32
is still the best choice for this parameter.
Compiler Optimization Levels
The Alpha and PISA architecture compilers have several levels of optimization choices
for the compiled executable code. All of the results discussed so far have dealtwith the
highest optimization level for each. This section discusses the effect the algorithm has
on executables compiled with different levels of code optimization. Figure 6.8 shows
how the compiler optimization level affects the average speedup for all 40 benchmark
programs.
a.
3
o
0)
0)
Q.
CO
16
14
12
10
8
6
4
2
0
^^NKF = 8
-O-NKF = 16
a NKF = 32
* Ideal prefetcher
0 1 2 3
Optimization level
Figure 6.9: Average speedup for various optimization levels ofAlpha
76
The average speedup decreases as the optimization level increases. A possible reason
for this is that there are less prefetch instructions embedded in the code in the lower
optimization levels, so there is more opportunity for the prefetch algorithm to provide
a speedup through prefetching. Another possibility could be that the code it its un-
optimized state follows more structured and predictable patterns, because the control
logic has not been obfuscated as much through the optimization process.
From the graph it can be seen that in the lower optimization levels, there is in fact a
significant performance gain as the NKF is increased from 8 to 16, but there is amuch
smaller performance gain as NKF is increased from 1 6 to 32.
14
12
a 10
3
a> 8
0)
(/} 6
55
4
2
0
-e- NKF = 8
-S- NKF = 16
A NKF = 32
x Ideal prefetcher
1 2
Optimization level
Figure 6.10: Speedup for various optimization levels of PISA
77
Figure 6.10 shows the average speedup vs. optimization level for the PISA architecture
for all benchmark programs. The trend it follows is the reverse of that of the Alpha; as
the optimization levels increase, the algorithm shows a greater speedup when applied.
One possible reason that this is the case is due to the lack of the PREFETCH
instructions in the PISA instruction set architecture. Since optimization does notmean
introduction of compiler-based prefetches for this architecture, there is still potential
for the algorithm to produce a speedup with it's prefetching. The next chapter will
compare the results from the evaluation done in chapter 5 and the simulation results
from this chapter.
78
Chapter 7
Discussion
The objective of this work is to present an algorithm which uses digital signal
processing techniques to perform prefetching in a microprocessor, and then make an
evaluation of the feasibility of using digital signal processing for prefetching. This
chapter discusses the results from the evaluation and simulation phases, and
determines if this algorithm is feasible.
The main point to consider in determining if this algorithm is feasible is whether or
not it can produce a speedup in the execution of a program that outweighs to cost of
implementing the algorithm. While this work does not make an attempt to suggest a
particular implementation of the algorithm, it can be characterized in terms of
complexity based on the NKF. Basically, the more filters allocated for the algorithm
the more complex it will be to implement. However, as described in [30] the amount
of available transistors per unit area is continually increasingwith time, so the hardware
complexity of the algorithm is not as important an issue as itwas in previous years.
In chapter 6, the best value for the NKF parameter is determined to be 32, based on
the marginal gain in speedup per filter, so this value is used to compare the evaluation
and simulation results in this chapter. Note that research has shown in chapters 5 and
6 that increasing the amount of filters available does result in increased performance,
79
so according to an implementation by implementation basis, the choice to increase this
parameter for additional performance is available.
The evaluation of the Alpha benchmarks showed that on average the algorithm
correctly prefetched 29.4% of the miss references. Overall, the ideal prefetcher
(correcdy prefetching 100% of the miss references) achieved a speedup of 39.0% for
those programs that have a potential speedup of greater than 10%. If the coverage of
the algorithm were ditecdy correlated to the speedup, then the algorithm should be
able to result in an overall speedup of 11.5%, but the algorithm only showed an
average speedup of 6.5% for these benchmark programs. This is mainly because the
effectiveness and the speedup cannot be direcdy correlated, because there are many
other factors affecting the actual speedup that an execution sees. One issue that has
not been taken into account is the timeliness of the prefetcher (discussed later in this
chapter), and another issue is the nature of the execution itself. A cache miss does not
necessarily completely stall the processor; since this is a superscalar multi-issue out of
order processor, there may be other instructions that can execute in the time that the
processor is waiting for a cache line to arrive. Even if the cache line were immediately
available, the processor would not always experience an instantaneous speedup equal
to the memory latency. Due to this, each cache miss has its own unique benefit that
could be seen if this cache miss were prevented. Since all cache misses are not equal,
the effectiveness cannot be directly correlated to the speedup. If every cache miss in
80
the processor did contribute a number of stall cycles exactiy equal to memory latency,
then this comparison would be valid.
The PISA simulations show a similar trend. With an NKF value of 32, the PISA
evaluation resulted in effectiveness coverage of 40.9%, and the ideal prefetcher caused
a 38.3% speedup. If the effectiveness were correlated to the speedup, then the PISA
simulation would see a speedup of 15.7%, but the PISA architecture in simulation sees
a speedup of 5.6%. The trend of the actual simulation speedup being lower than the
theoretical speedup based on the effectiveness could be caused by the nature of the
execution as described above, and also due to the fact that this algorithm is not
perfecdy timely. This algorithm does not take special measures in an attempt to fully
hide the memory latency of a demand fetch operation by prefetching; it simply
attempts to hide as much as possible.
One observation that can be made from comparing the evaluation and simulation data
is that if the algorithm proved to have very poor effectiveness for a particular
execution, it is also true that it will not have a significant impact on the execution time
of a program. For both the Alpha and PISA ISAs, see bh, health, and vpr. Each of
these programs showed a very poor effectiveness due to the nature of the cache
access
patterns, and they also showed a negligible speedup.
81
Timeliness
Timeliness is an important factor to consider in hardware based prefetching [14].
Timeliness is a measure of a prefetching algorithm's ability to fetch the data into cache
early enough such that the memory latency is completely hidden, but not so early that
it is at risk of being removed from the cache before it has been used. The algorithm
described in this work does not explicidy address the issue of timeliness in its design; it
will initiate a prefetch when it determines it is appropriate, and the data may arrive at a
time which is too early or too late to be considered optimally timely. However, even if
a prefetch is initiated some number of cycles before the processor demands it but it is
not available yet, the prefetcher has still hidden some of the memory latency.
There are a couple of different simple modifications to this design to address the issue
of timeliness. Concerning prefetches that are issued too early, the research conducted
in [19] shows that even if a cache block is loaded into the level one cache too soon and
is replaced by a cache line demanded from the processor, it is still very likely that the
block remains in the level two cache. If the cache line is still in the level two cache,
then much of the memory latency has still been hidden compared to the scenario in
which no prefetching was performed.
Concerning prefetches that are issued too late, two modifications to the algorithm
could be made to address this. First, the algorithm uses cycle time data for the
purposes of recycling filters which are not performing well, but does not use cycle time
82
data for determining the correct time to prefetch. A simple addition would be to
monitor the cycle time delta between input points to a particular filter, and incorporate
this data into the state matrices of the Kalman filter. The filter could then offer two
predictions, one being the next memory address referenced in this stream, and the
second being the CPU cycle that it will be referenced in. With this information, the
prefetcher logic could be enhanced to issue the prefetch a number of cycles earlier
than the CPU cycle prediction equal to the memory latency. Anothermodification is to
predict more than one state ahead. The Kalman filters used in this algorithm are only
used to predict a single step ahead. It is also possible to predict n states ahead, by
applying the time update equations multiple times at each time step to achieve multiple
predictions. Since each additional prediction results in additional error due to the
Kalman filter not incorporating measurement data into these calculations, the
prefetcher should only issue prefetches on these extra predictions if there is a high
level of confidence that the predictions are correct. The current prefetcher will issue a
single prefetch only when the last prediction is known to be correct. The prefetcher
should extend this model in the multiple prefetch case and only issue a second order
prefetch if the last two predictions were correct, and so on. This would address the
situation in which the cache miss references are arriving at cycle time intervals less than
the memory latency, in which the current design would never be able to completely
hide the memory latency.
83
Feasibility and practicality
Based on the data acquired through evaluation and simulation, the algorithm presented
in this work is in fact feasible, but perhaps not practical with its current design. The use
of digital signal processing techniques to perform prefetching in a microprocessor has
been shown to have potential in this work. This algorithm allows these basic digital
signal-processing blocks (in this work, Kalman filters) to be utilized and based on their
predictive qualities issue prefetches for a net speedup in execution time. However, the
exact algorithm presented in this work needs to be refined to produce better results
before attempting an implementation of it. There are several attributes of this
algorithm that could be researched in order to improve its performance.
The basic issue dealt with by prefetcher designers is to attain a model of the system
that needs to be predicted. This algorithm provides a framework that these models can
be integrated with to quickly evaluate the merits of a particular model as it pertains to
prediction of a memory reference stream. The mathematical model used in this work is
a simple one based on the concept that many executions have strided memory access
patterns [10, 23, 35]. This model could be extended to more closely model other
identifiable trends in the access patterns for various programs. In addition, this
algorithm uses the same model for each of the internal Kalman filters. Murphy [9]
describes a method in which several linear models are used, and the Kalman filter
dynamically switches between them or takes some linear combination of them to best
84
fit the current data. Multiple mathematical models appropriate for different access
patterns could be used, and some algorithm could be used to determine which model
best fits the pattern at runtime.
One main difference between the algorithm presented in this work and the previously
proposed prefetch algorithms in chapter 3 is that many of the previously proposed
algorithms make use of a secondary cache-like structure that holds prefetched values
and is examined in parallel with the level one data cache. This addresses the problem
of cache pollution, which is present in this algorithm as can been seen in the
simulation results for the Alpha architecture in chapter 6. This algorithm could easily
be enhanced to use a similar structure as well.
The predictive properties of the Kalman filter are inherently uncertain. The time
update equations can be applied to the Kalman filter's state estimation to generate a
prediction of the next state's values, but without the integration of the measurement
(which is only available with the next data point), the prediction values have inherent
error in them. This algorithm uses the Kalman filter to determine which memory
address is likely to be missed next, and then issues a prefetch for the cache block that it
resides in. This limits the prefetch algorithm to being correct only if it predicted the
next cache block missed exacdy right. Since the prediction has inherent error, it could
be the case that the prediction was close, but not exacdy right (for instance, the next
cache block missed was an adjacent cache block to the one prefetched). A possible
85
enhancement to this dgorithm would be to integrate some concepts from the stream
buffer design [22] in which the algorithm would prefetch the predicted cache block,
and those adjacent to it. These results would then be stored in a stream-buffer like
structure that is associated with each internal filter, and they would be queried in
parallel with the data level one cache to check for a cache hit. This could increase the
coverage of the algorithm, but will impose an additional demand on the memory
hierarchy to supply the additional cache lines. These cache blocks should be fetched
with a lower priority than demand-fetch requests coming direcdy from the processor.
Even though this algorithm may not be entirely practical for use in the data level one
cache, it may be for the instruction level one cache. The performance of the
instruction level-one cache can be improved dramatically by use of the stream buffer
[22], because of its sequential access nature. A performance evaluation of this
algorithm with respect to the instruction cache could be done to determine its
effectiveness, if any. The algorithm could also be easily adjusted to level two caches, or
even to the main memory itself, where it would prefetch virtual memory pages from
the mass storage subsystem direcdy.
Another potential area for improvement is combination of this algorithm directly with
other algorithms. In [4], the Markov predictor is simulated alone, in parallel with
stream buffers and stride prefetchers, and in series with stream buffers and stride
prefetchers. Overall, the configuration that had the best coverage was the Markov
86
predictor in parallel with the stream buffer and stride prefetcher. The algorithm
proposed in this work could also be combined with other prefetched to obtain a
hybrid design that can maximize the benefit of each of the prefetcher designs.
Finally, this work uses the Kalman filter as the basic digital signal-processing block
because of its flexibility and other factors noted in chapter 3. The algorithm presented
in this work does not need a Kalman filter as its basic digital signal-processing block
due to the way it treats the digital signal-processing element as a black box in the
design. Other logic blocks could be used as long as they implement the same interface,
which is being able to accept a data point and then produce a prediction based on the
data points it has seen so far. Even a hybrid ofKalman filters and other filters could be
possible, but additional logic would need to be added to the control and prefetch logic
blocks to determinewhich of the logic blocks is most appropriate forwhich stream.
The next chapter provides some concluding remarks and summarizes the results
presented in this work.
87
Chapter8
Conclusions
This work proposes an algorithm that uses digital signal processing blocks to watch the
miss reference stream of a program execution, and perform prefetches based on the
predictions that the digital signal processing blocks generate. The digital signal
processing blocks are implemented as Kalman filters in this work, though it would be
possible to use any logic block in its place that implements the same interface to the
algorithm's black box design. Kalman niters are used for their flexibility and ability to
be effectively implemented in hardware, as well as their predictive qualities.
The algorithm works bywatching the miss reference stream and the memory reference
stream of a microprocessor's execution of a program. The miss reference stream
provides a series of addresses that are interpreted as a digital signal. Simple
thresholding is used to differentiate points from the miss reference stream as
belonging to different virtual reference streams, and these virtual reference streams are
de-multiplexed from the miss reference stream into the input streams for each of the
Kalman filters. Each filter produces a prediction point for each input point it is given,
and if the filter's previous prediction is determined to be correct, then a prefetch will
be issued based on the filter's current prediction. The algorithm also has recycling
capabilities built in to free filters that are kstening on invalid virtual streams, or virtual
streams that have ended.
88
The algorithm uses a particle with constant velocity model for its internal Kalman filter
model. This model is designed to effectively recognize and predict strided memory
access patterns, which are present in many typical scientific and multimedia benchmark
applications.
The algorithm was first evaluated for accuracy and effectiveness using a trace-driven
simulation in which the algorithm is performed over a series of pre-generated cache-
miss streams for a broad set ofbenchmark programs. The Alpha 21264 instruction set
architecture and the PISA instruction set architecture were both simulated because the
Alpha 21264 instruction set architecture contains ISA support for prefetching
instructions whereas the PISA architecture does not. On average, the algorithm
performed better for the PISA architecture than the Alpha architecture, with the PISA
architecture achieving an 81.0% accuracy rate and a 40.9% coverage rate when using
32 Kalman filters. The Alpha architecture saw a 75.7% accuracy rate and a 29.4%
overall coverage for the same algorithm parameters.
The algorithm is also simulated using a modified SimpleScalar design to determine if
the algorithm is effective in reducing execution time. The Alpha benchmark programs
at the Alpha's compiler optimization level four showed an average speedup of 6.5%, or
an average of 17.7% of the speedup an ideal prefetcher would supply, and the PISA
architecture at compiler optimization level three saw an average speedup of 5.6%, or
17.8% of the speedup an ideal prefetcher would supply. It is also
observed that for
89
increasing compiler optimization levels, the algorithm shows a decreasing speedup for
the Alpha ISA, but an increasing speedup for the PISA ISA.
The algorithm presented in this work is in fact feasible because it does show that is it
capable of achieving a speedup in execution time, but it may not be practical for
implementation with its current design, as shown by the performance results. There
are many areas of research that may be conducted to improve the performance of the
algorithm, and it has been designed in a general enough manner that enhancements
should be easy to integrate into the existing design.
Future Directions
Chapter 7 enumerated many possible areas of the algorithm that could be improved
upon to increase the performance of the dgorithm with respect to hardware
prefetching. The general infrastructure presented in this work could also be extended
to areas beyond simple hardware prefetching as well. Load value prediction is the
technique of predicting the values retrieved by load instructions before the load
actually executes. Lipasti et al. [37] propose a mechanism for load value prediction that
uses the address of the load instruction as the input, and maintains tables of previous
load values and confidence counters. It then speculatively executes instructions based
on predicted load value outcomes, which are committed if the load value is determined
to be correct. Many programs exhibit a high frequency of loads that continually load
values that are constant for the execution, and load value prediction is a promising area
90
of research. Burtscher and Zorn [38] extend the load value predictor by using
prediction outcome history information for determining whether or not to issue a
prediction, instead of the saturating counters used by other schemes. Both of these
schemes howevermake use of storing tables ofvalues.
The infrastructure presented in this work could be applied to load value prediction as
well, either as the confidence mechanism or as the value prediction mechanism itself.
The algorithm as it works now is based on the fact that the value it is predicting is
changing, e.g. miss reference addresses. Load value prediction typically works by
predicting that a load will result in the same value it did in a previous load. However,
this algorithm could easily be applied as the confidence mechanism to determine if a
load value prediction should be used.
91
Bibliography
[1] J. Collins, S Sair, B Calder, and D. M. Tullsen. Pointer Cache Assisted Prefetching.
In Proceedings ofthe
35' InternationalSymposium onMicroarchitecture, December 2002.
[2] Z. Hu, M. Martonosi, S. Kaxkas. TCP: Tag Correlating Prefetchers. In Proceedings of
the Ninth International Symposium on High-Performance ComputerArchitecture (HPCA
'03), pages 317-326, February 2003.
[3] V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. Branch
History Guided Instruction Prefetching. In Proceedings of the
7lb Int'/ Conference on
High Performance ComputerArchitecture (HPCA), pages 291-300, Monterey, Mexico,
January 2001.
[4] D. Joseph and D. Grumwald. Prefetching using Markov Predictors. IEEE
Transactions on Computers, 48(2):121-133, 1999.
[5] Y. Zhang, S. Haga, and R. Barua. Execution History Guided Instruction
Prefetching. To Appear, Journal of Supercomputing, Kluwer Academic
Publishers, 2003.
[6] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. In
Transactions oftheASME Journal ofBasic Engineering, 82 (Series D): 35-45, March
1960.
[7] G. Welch, G. Bishop. An Introduction to the Kalman Filter. WWW
htto://www.cs.unc.edu/~welch/rnedk/pdf/kalrnan_into.pdf. Accessed August
2003.
[8] P. S. Maybeck. Stochastic Models, Estimation, and Control, Vol 1. Academic Press,
lnc, 1979.
[9] K. P. Murphy. Switching Kalman Filters. In Compaq Cambridge 'Research Eab Tech
Report 98-10. August 1998.
[10] H. Govindarajalu, A. Rengachari, A. Omondi. DSTRIDE: Data-cache Miss-
address-based Stride Prefetching Scheme for Multimedia Processors. In
Australiasian ComputerArchitecture Conference, Gold Coast, Australia, January 2001.
[1 1] S. P. E. Corporation. The SPEC benchmark suites. http:lIwww,spec,org/ .
[12] J. Tse and A. J. Smith. CPU Cache Prefetching: Timing Evaluation of Hardware
Implementations. In IEEE Transactions on Computers, 47(5):509-526, May 1998.
92
[13] J. Nwagbaraocha. Discrete Kalman Filter System User's Guide, May 2003.
[14] W. A. Wong, J Baer. The Impact of Timeliness for Hardware-based Prefetching
fromMain Memory. 2002.
[15] D. C. Burger, T. M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical
Report CS-TR-97-l342, Univ. ofWisconsin-Madison, June 1997.
[16] A. Roth, A. Moshovos, and G. S. Sohi. Dependence Based Prefetching for Linked
Data Structures. In 81 International Conference onArchitecturalSupportfor Programming
Languages and Operating Systems, pages 115-126, October 1998.
[17] M. K, P. Stenstrom, and M. Dubois. The FAB Predictor: Using Fourier Analysis
to Predict the Outcome of Conditional Branches. In Proceedings of the 8th
International Symposium on High-Performance Computer Architecture, pages 223232.
IEEE Computer Society, Feb. 2002.
[1 8] G. Semeraro, G Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and
M. L. Scott. Energy-Efficient Processor Design Using Multiple Clock Domains
with Dynamic Voltage and Frequency Scaling. In Proceedings ofthe 8th International
Symposium on High-Performance ComputerArchitecture, pages 29^-0. IEEE Computer
Society, Feb. 2002.
[19] T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler
Algorithm for Prefetching. In 5th International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 62-73, October 1992.
http://www-2.cs.cmu.edu/~tcm/mowry92/tech.html
[20]S. Przybylski. The Performance Impact of Block Sizes and Fetch Strategies. In
Proceedings of the 1
7lhAnnual International Symposium on ComputerArchitecture, pages
160-169, May 1990.
[21] F. Dahlgren, M. Dubois, and P. Stenstrom. Sequential Hardware Prefetching in
Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 6(7):733-746, July 1995.
[22] N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a
Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of the 1 7th
Annual International Symposium on Computer Architecture, pages 364-373, May
1990.
93
[23] T. Chen and J. Baer, Reducing memory latency via non-blocking and prefetching
caches. In Proceedings of the 5th Int. Conf. Architectural Support for Programming
Languages and Operating Systems, pp. 51-61, Oct. 1992.
[24] Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for
Correlation Prefetching. In Proc. 29thAnnual Intl. Symp. on ComputerArchitecture,
May 2002.
[25] G. Semeraro, D. H. Albonesi, S. G. Dropsho, G. Magklis, S. Dwarkadas and M.
L. Scott. Dynamic Frequency and Voltage Control for a Multiple Clock
Domain Microarchitecture. In Proceedings of the
35"' Annual International
Symposium onMicroarchitecture (MICRO-35), November, 2002.
[26] R. D. Turney, A. M. Reza, J. G. R. Delva. FPGA Implementation of Adaptive
Temporal Kalman Filter for Real Time Video Filtering. Core Solutions Group,
Xilinx, March 1999.
http://www.xilinx.com/products/logicore/dsp/temporal kalman fltr.pdf
[27] S. Sak, T. Sherwood and B. Calder. Quantifying Load Stream Behavior. In the
Proceedings ofthe 8th International Symposium on High-Performance ComputerArchitecture,
February 2002, Boston, MA.
[28] D. F. Zucker, M. J. Flynn, and R. B. Lee, A comparison of hardware prefetching
techniques for multimedia benchmarks. In Proceedings of the International Conference
onMultimedia Computing and Systems, Hhoshirna, Japan, June 1996, pp. 236-244.
[29] The Mathworks, Inc. Matlab, 2004.
http: / /www.mathworks.com/products/madab/
[30] Semiconductor Conductor Association. International Technology Roadmap for
Semiconductors. 2003. http://pubhc.itrs.net/Files/2003ITRS/Home2003.htm
[31] D. A. Patterson, J. L. Hennessy. Computer Architecture A Quantitative
Approach 2nd edition. Morgan Kaufrnann Publishers, Inc. 1996. pp. 373-483.
[32] A. Rogers, M. Carlisle, J. Reppy, and L. Hendren. Supporting dynamic data
structures on distributed memory machines. In ACM Trans, on Programming
Languages andSystems, 17(2), March 1995.
[33] C. Lee et al. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and
Communications Systems. InMICRO-30, 1997.
94
[34] Digital Equipment Corp. Compiler Writer's Guide for the Alpha 21264, June
1999. http://ftp.cliffltal.com/pub/Dijptal/info
[35] J. Fu, J. PateL B. Janssens. Stride Directed Prefetching in Scalar Processors. In
Proceedings of the
25th Annual International Symposium on Microarchitecture (MICRO-
25), November, 1992.
[36] G. Semeraro. Configuration to Match the Alpha 21264 as Closely as Possible.
Alpha21264.cfg. May, 2001.
[37] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen. Value Locality and Load Value
Prediction. In Proceedings of the 7th Annual Conference on Architectural Support for
ProgrammingLanguages and Operating Systems, Oct. 1996, pp. 138-149.
[38] M. Burtscher and B. G. Zorn. Prediction Outcome History-based Confidence
Estimation for Load Value Prediction. In the Journal ofInstruction LevelParallelism,
May 1999.
95
