Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy by Doshi, Sheela A
Louisiana State University
LSU Digital Commons
LSU Master's Theses Graduate School
2006
Compiler Assisted Cache Prefetch Using
Procedure Call Hierarchy
Sheela A. Doshi
Louisiana State University and Agricultural and Mechanical College, sdoshi1@lsu.edu
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_theses
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU
Master's Theses by an authorized graduate school editor of LSU Digital Commons. For more information, please contact gradetd@lsu.edu.
Recommended Citation
Doshi, Sheela A., "Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy" (2006). LSU Master's Theses. 3385.
https://digitalcommons.lsu.edu/gradschool_theses/3385
COMPILER ASSISTED CACHE PREFETCH USING
PROCEDURE CALL HIERARCHY
A Thesis
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering
in
The Department of Electrical and Computer Engineering
by
Sheela A Doshi
Bachelor of Technology in Computer Science and Engineering
Jawaharlal Nehru Technological University, Kakinada, 2002
May 2006
D e d i c a t e d T o
M y P a r e n t s & M y D e a r e s t S i s t e r, K a r i s h m a
ii
Acknowledgements
I would like to express my gratitude to my advisor, Dr. David Koppelman for his guidance, and
constant motivation towards the completion of this thesis. He introduced the thesis topic and
helped me understand and modify RSIML to use in my work. His technical advice and suggestions
helped me to overcome hurdles and kept me enthusiastic and made this work a wonderful learning
experience.
I would like to thank my committee members Dr. J. Trahan and Dr. Vaidyanathan, for taking
time out of their busy schedule and agreeing to be a part of my committee. I would like to also
thank them for their valuable feedback.
I would like to thank Dr. Kundu, from the Department of Computer Science, the faculty
members and Shirley and Tonya of the Department of Electrical Engineering, for all the support
and making my study at Louisiana State University a pleasant experience.
I would like to thank my parents and sister without whom I would not have made it to
this point. I would like to thank Prabod Mama, Saru Masi and Vidhu Masi for their love and
encouragement. I would like to thank my roommates & friends here at LSU and back home for all
the help, love and unending support.
iii
Table of Contents
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
L IST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
L IST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Existing Prefetch Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Prefetch Using Procedure Call Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 FUNDAMENTAL CONCEPTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Memory Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Sequential Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Stride Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Irregular or Arbitrary Access Pattern . . . . . . . . . . . . . . . . . . . . . 9
2.3 Linked Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Linked Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Classification of Loads Accessing a LDS . . . . . . . . . . . . . . . . . . 10
2.3.3 Locality in LDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3.1 Spatial Locality in LDS . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3.2 Temporal Locality in LDS . . . . . . . . . . . . . . . . . . . . . 11
2.4 Prefetch Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Metrics to Evaluate a Data Prefetch Mechanism . . . . . . . . . . . . . . . 11
2.4.2 Approaches To Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Prefetch Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Software Prefetch Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Thread-based Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 SPAID Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iv
3.2 Hardware Prefetch Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Sequential Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Stride Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Prefetch Using Stream Buffers . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 Prefetch Using Markov Predictor . . . . . . . . . . . . . . . . . . . . . . 21
3.2.5 Prefetch by Dependence Graph Precomputation . . . . . . . . . . . . . . . 22
3.3 Hardware/Software Prefetch for Dynamic Data Structures . . . . . . . . . . . . . . 23
3.3.1 Greedy Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 History-Pointer Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Data-Linearization Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Prefetch Using Jump Pointers and Prefetch Arrays . . . . . . . . . . . . . 24
3.3.5 Pointer-Cache Assisted Prefetch . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6 Compiler Directed Content Aware Prefetching . . . . . . . . . . . . . . . 26
4 COMPILER ASSISTEDCACHE PREFETCHUSING PROCEDURECALL HIERARCHY . . 27
4.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Directive Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Function Information Table . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Function Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.4 Memory Instruction Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.5 Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Example Using Health Benchmark . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1.1 Procedure Call Example . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1.2 Store Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1.3 Storing An Input Argument . . . . . . . . . . . . . . . . . . . . 34
4.2.1.4 Storing A Pointer Variable Assigned During Program Execution 35
4.2.1.5 Traversing a LDS . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1.6 Prefetching for Recursive Functions . . . . . . . . . . . . . . . 37
4.3 Data Prefetch Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 RSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 RSIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2.1 New API for Hardware Prefetching . . . . . . . . . . . . . . . . 44
5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 PERFORMANCEEVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Individual Load Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1.1 Conventional System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1.2 Effect Of Varying Cache Size . . . . . . . . . . . . . . . . . . . . . . . . 48
v
6.1.3 Effect Of Varying Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Timeliness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
APPENDIX: INPUT DATA FOR CAPPH . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
V ITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi
List of Tables
4.1 Select Code With Corresponding Directive Table Entries . . . . . . . . . . . . . . . . 32
4.2 Function Information Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1 em3d Troublesome Load Statistics On a Conventional System . . . . . . . . . . . . . 46
6.2 health Troublesome Load Statistics On a Conventional System . . . . . . . . . . . . . 47
6.3 mst Troublesome Load Statistics On a Conventional System . . . . . . . . . . . . . . 47
6.4 em3d Troublesome Load Statistics For Varying L1 Cache Size . . . . . . . . . . . . . 49
6.5 health Troublesome Load Statistics For Varying L1 Cache Size . . . . . . . . . . . . . 50
6.6 mst Troublesome Load Statistics For Varying L1 Cache Size . . . . . . . . . . . . . . 51
6.7 em3d Troublesome Load Statistics For Varying Memory Latency . . . . . . . . . . . . 52
6.8 health Troublesome Load Statistics For Varying Memory Latency . . . . . . . . . . . 53
6.9 mst Troublesome Load Statistics For Varying Memory Latency . . . . . . . . . . . . . 54
7.1 Prefetch Directives for mst. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Prefetch Directives for em3d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Prefetch Directives for health. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vii
List of Figures
4.1 CAPPH Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Synchronization Problem Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 PSE Output Showing Procedure Call Directive Fordealwithargs . . . . . . . . . . . . 33
4.4 PSE Output Showing Store Directive For Input Variable . . . . . . . . . . . . . . . . . 34
4.5 PSE Output Showing Store Directive For Pointer Variable . . . . . . . . . . . . . . . . 35
4.6 Traversing A Linked Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Tree Data Structure Used In Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Variation of Accuracy with Memory Latency . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Variation of Accuracy with L1 Cache Size . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Variation of Accuracy with Cache Line Size . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Variation of Coverage with Memory Latency . . . . . . . . . . . . . . . . . . . . . . 57
6.5 Variation of Coverage with L1 Cache Size . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6 Variation of Coverage with Cache Line Size . . . . . . . . . . . . . . . . . . . . . . . 58
6.7 Variation of Timeliness with Memory Latency . . . . . . . . . . . . . . . . . . . . . . 59
viii
Abstract
Microprocessor performance has been increasing at an exponential rate while memory system
performance improved at a linear rate. This widening difference in performances is increasingly
rendering advances in computer architecture less useful as more instructions spend more time wait-
ing for data to be fetched from the memory after a cache miss.Data prefetchingis a technique
that avoids some cache misses by bringing data into the cache before it is actually needed. Dif-
ferent approaches to data prefetching have been developed, however existing prefetch schemes do
not eliminate all cache misses and even with smaller cache miss ratio, miss latency remains an
important performance limiter.
In this thesis, we propose a technique called Compiler Assisted Cache Prefetch Using Pro-
cedure Call Hierarchy (CAPPH). It is a hardware-software prefetch technique that uses a compiler
to provide information pertaining to data structure layout, data-flow and procedure-call hierarchy
of the program to a mechanism that prefetches linked data structures (LDS). It can prefetch data for
procedures even before they are called by using this statically generated information. It is also ca-
pable of issuing prefetches for recursive functions that access LDS and arbitrary access sequences
which are otherwise difficult to prefetch.
The scheme is simulated using RSIML, a SPARC v8 simulator. Benchmarks em3d, health
and mst from the Olden suite were used. The scheme was compared with an otherwise identical
system with no prefetch and one using sequential prefetch.
Simulations were performed to measure CAPPH performance and the decrease in the miss
ratio of loads accessing LDS. Statistics of individual loads were collected, and accuracy, coverage
and timeliness were measured against varying cache size and latency. Results from individual loads
accessing linked data structures show considerable decrease in their miss ratios and average access
times. CAPPH is found to be more accurate than sequential prefetch. The coverage and timeliness
ix
are lower in CAPPH than in sequential prefetch. We suggest heuristics to further enhance the





The decades-long increase in processor performance is the direct impact of dramatic breakthroughs
in IC fabrication technology and ideas from computer engineering, which together have made it
possible for processors to execute more than one instruction per cycle and exploit instruction level
parallelism. Ironically, as microprocessor performance increases exponentially, processor utiliza-
tion drastically decreases. Though limited instruction-level parallelism and branch prediction ac-
curacy play a role, a major reason for this underutilization is that the memory system has not been
able to keep up with processor speeds, and so instructions spend considerable portion of their run
time waiting for memory requests.
Several ways in which the memory subsystem can still keep pace with processor perfor-
mance have been proposed. Cache memory was introduced as a cost-effective way to store small
chunks of data that might be needed by the application closer to the processor. A cache is a small-
capacity memory that is comparable in speed with the microprocessor. As the processor executes
code, it looks up the cache first whenever it needs data. If data is found in the cache, it is imme-
diately provided to the processor, otherwise the processor requests for it to be fetched from the
larger and slower main memory. Because the access time for data found in the cache is much less
than that for data found in the main memory, it effectively reduces average memory latency for
programs that exhibit a high degree of locality in their addressing patterns [17, 6]. Caches are a




In a conventional system, anon-demandmemory fetch policy brings a line into the cache
from the main memory only after there is request for data that is not present in the cache. This
results in poor processor utilization, as instructions wait for the memory request to be satisfied
[17]. Although a large number of loads are satisfied by data present in the cache, the remaining,
which miss the cache, in spite of being smaller in number, can limit instruction-level parallelism
and dominate execution time. The situation is exacerbated when cache miss ratio increases, as
when applications exhibit little reference locality.
Memory performance for programs with good spatial locality can be improved by making
the cache lines long. However, some code suffers with longer cache lines and so the line size
is set to a compromise value, minding the slowdown in programs with low spatial locality. This
slowdown is due to cache pollution caused by long lines evicting necessary data from the cache
as it is more likely that only a small fraction of a long line will hold useful data. This reduces
the space for necessary data and increases the number of cache misses and memory traffic for it.
In multiprocessor environments, long cache lines also increase the probability offalse sharing,
which occurs when two or more processors want to access different words within the same cache
line, and one of them is awrite access. The cache hardware cannot concurrently provide the
processors access to the separate words requested by them as they belong to the same line, one
of them is a write access (data dependency and data coherence constraints are to be maintained)
and the smallest memory unit it can operate on is a cache line. The hardware also generates cache
coherence traffic to ensure that changes made to the line are visible to all processors caching that
line, although it is unnecessary, as only the processor executing thewrite access references the
written word [17].
As more data can be stored in larger caches, the frequency of cache misses, for programs
with good locality, can be reduced by increasing the cache size. However, large caches are expen-
sive, draw a lot of power and reduce the yield of good chips in manufacturing.
Prefetching is a technique that avoids some cache misses without increasing cache size. A
prefetch mechanism generatesprefetch requeststo bring data into the cache before it is actually
needed, thus overlapping data access with pre-miss computations. It demands higher bandwidth to
3
sustain the increased frequency of memory requests issued by the processor resulting from some
processor stall cycles being avoided. Unfortunately, the bandwidth of the memory system has to be
larger than as indicated by improved performance, to accommodate additional memory traffic due
to unnecessary prefetches issued for data that is not really needed by the processor, as the prefetch
technique may not be 100% accurate. It is practicable, as it is easier to build higher-bandwidth
systems than systems with low-latency, and prefetching uses memory bandwidth to bridge the
processor-memory performance gap.
1.2 Existing Prefetch Schemes
Each program exhibits a variety of memory access patterns. Each prefetch technique, as elaborated
on in Chapter 3, works well with particular memory reference patterns. For example, in sequential
prefetch, when a line referenced, the line next to it is prefetched. In stride prefetch, reference to
a line triggers prefetch to lines that are multiples-of-a-constant distance from the referenced line.
Sequential patterns occur to a very large degree in almost all applications and the technique is
successful at eliminating a large number of cache misses. Stride prefetch is useful in programs that
use arrays intensively. Nevertheless, many programs typically still suffer a lot of cache misses on
systems using sequential or stride prefetch, most of which are caused by memory access references
with low temporal and spatial locality.
Many modern applications use dynamic memory allocation extensively and generate linked
data structures (LDS). They are characterized by irregular memory access sequences, which make
it difficult to accurately predict the addresses to be prefetched sufficiently early to mask the large
cache miss latency. An aggravated version of this problem, referred to aspointer-chasing problem
is becoming increasingly familiar to applications. It describes a situation in which a substantial
number of loads are traversing LDS links. These loads are hard to predict a priori and can adversely
affect performance if data is not found in the cache.
Many hardware and software schemes have been proposed to prefetch linked data struc-
tures. For example, history pointer prefetch, memorizes an address sequence the first time it is ac-
cessed by the program, and prefetches it when the program accesses the memory locations again,
4
in the same order [13]. In greedy prefetch, when a node (refer to Chapter 2) is referenced, all
nodes linked to it are prefetched, under the assumption that one of them will be accessed next [13].
However, these techniques have limited capability to sufficiently eliminate avoidable cache misses.
Thus, cache miss latency remains an important performance limiter in LDS-using programs, even
with existing prefetch schemes.
In 2003, Sukhniet al proposed “Compiler Directed Content Aware Data Prefetching for
Dynamic Data Structures,” (CDCAP) [2] to prefetch the nodes of LDS with the help of compiler
generated information. In this method, the compiler is used to insert hints into load instructions
which fetch the base address of a LDS and write hints about data structure layout and the memory
access patterns that the following loads will exhibit to the object file. When a load instruction with
a hint is executed, the hint is passed on to the prefetch hardware which accordingly issues prefetch
requests for nodes that will be accessed next. This technique is accurate as it knows where to find
the addresses which will be needed next and can issue prefetches for them. However, it has to wait
for the load instruction with the hint to arrive and the first node of the LDS which might result in a
cache miss, to be fetched, before it can start prefetching.
1.3 Prefetch Using Procedure Call Hierarchy
In this thesis, we propose “Compiler Assisted Cache Prefetch using Procedure Call Hierarchy”
(CAPPH), which is an extension of CDCAP. As with CDCAP, the compiler is used to gather
information pertaining data structure layout and unlike CDCAP, program behavior in terms of
loops, data-flow in procedures and procedure call hierarchy of the program are also used to create
prefetch directives. The data needed therein is obtained by static data-flow analysis of the program.
The directives are placed into hardware tables when the program is loaded and are executed by the
prefetch hardware in advance of the load instruction they target. This mechanism works well for
prefetching LDS and is capable of prefetching data for a function even before it is called. It is also
successful in prefetching data for recursive functions which are otherwise difficult to prefetch.
5
1.4 Thesis Organization
The remainder of this thesis has been organized as follows: Chapter 2 provides an overview of
basic concepts of caches and data prefetch. Chapter 3 discusses prior work in data prefetching and
Chapter 4 describes the proposed CAPPH mechanism. Chapter 5 provides simulator features and
specifications of the system modeled by it. Chapter 6 presents experimental results verifying of
the proposed scheme. Finally, in Chapter 7, we conclude this thesis with a summary of the results




A modernmemory systemconsists of one or morecache layers, followed bymain memory. The
execution of a load instruction can result in are d accessto the memory system. This is also called
a memory reference. A memory system with a two-level cache will check for the data in thelev l
one (L1) cache, if not present thelevel two (L2) cacheis checked, if not present main memory is
checked. The procedure is similar for systems having other than two layers. Transfer between the
layers is done in units calledcache lines.
If the data is found in the L1 cache the read access is said to result in anL1 hit, if not found
it is said to result in aL1 miss. Similarly for L2 cache, if the data is found in the L2 cache the read
access is said to result in aL2 hit otherwise it is said to result in aL2 miss. Hit andmissrefer to
presence or absence when a memory accesses starts.
The read access for a load instruction starts when the address is offered by the CPU to the
memory system, itcompleteswhen data arrives at the CPU; it’saccess latencyis the time from
start to completion. The time between when a request arrives at the memory and finishing when
data is sent to the cache is calledmemory latency. When requested data is in the L1 cache, the time
required to bring it into the processor is calledL1 hit latency. L2 latencyandmain memory latency
are defined similarly. Data requested from memory is supplied fastest to the processor when it is
found in the L1 cache followed by L2 cache and main memory.
The cacheis made of low-latency, high-cost memory. Parts of it are on the same chip
as the CPU and it is comparable to the processor in speed. It is used to store small chunks of
6
7
data frequently accessed by the processor. The main memory is made of high-latency, low-cost
devices. Its access time is very large compared to the processor cycle time. So when the data is
being brought from the main memory, many processor cycles are wasted as instructions wait on
data dependencies.
Caches improve performance by reducing processor stall cycles and so increase the rate at
which instructions execute. They work because a large number of programs exhibit good locality
in their memory access patterns.Spatial localityrefers to the likelihood of a line being accessed if
a line near it was just accessed.Temporal localityis the likelihood of a line that is accessed at one
point in time to be accessed again sometime in the near future.
2.2 Memory Access Patterns
The pattern exhibited in the memory addresses consecutively accessed by a load instruction or a
group of load instructions is called itsmemory access pattern.
2.2.1 Sequential Access Pattern
An address sequence is considered sequential for purposes of prefetch if the addresses normalized
to the beginning of a line are sequential. An address is normalized by setting the lowl bits to zero
when2l bytes is the line size. Sequential access pattern is common in almost all programs and not
only in programs accessing arrays intensively.
A load is said to havesequential access patternif it sequentially accesses contiguous mem-
ory locations. The memory access sequence 0x5bf03, 0x5bf03, 0x5bf04, 0x5bf04, is an example
of sequential access pattern.
Consider the following code sample in which all array elements, which are stored in con-
tiguous memory locations, are accessed sequentially within a loop.
Integer i, A[N]





Suppose the address of the first element of the array isA and size of each element (in this
case integer) is four bytes, then the address sequence accessed by the code is sequential and is as
follows:
A, A + 4, A + 8, A + 12, . . . A + (i− 1)× 4 . . . A + (N − 1)× 4
2.2.2 Stride Access Pattern
A load is said to exhibit astride access patternif consecutive references access lines that are
equidistant from each other. For example, the access sequence 0x10000, 0x10100, 0x10200, .. and
so on, when the addresses are normalized constitutes a stride pattern.
This pattern describes a situation in which addresses accessed in consecutive references
differ by a constant amount called thestride. Consider the following code sample:
Integer i, A[M][N]




The above code shows an access to the first element of each row of a two-dimensional array
inside a loop. Two-dimensional arrays are often stored in contiguous memory locations with one
row’s elements following the other. Suppose that the address of the first element of the array isA,
and size of each element is four bytes, then the size of each row is4×N and the address sequence
accessed by the code is as follows:
A, A + (4N), A + 2× (4N), . . . A + (i− 1)× (4N), . . . A + (N − 1)× (4N)
This pattern is common in code accessing multidimensional arrays and is easy to predict.
9
2.2.3 Irregular or Arbitrary Access Pattern
A load is said to havearbitrary access patternif the address sequence does not follow a simple
pattern. These patterns commonly occur in pointer-intensive programs (code accessing linked data
structures (LDS)). Consider the following code sample to traverse a singly-linked list:
Node *head, *p
p = head /* p now points to the first node of the LDS */
While p != NULL
... = p->data
...
p = p->next /* p moves to the next node of the LDS */
End
In the above code sample,p points to a node in a linked list, and is advanced to the next
node each iteration. Letx be a memory address, then define[x] to be the contents of memory at
addressx and suppose the member “next” is12 bytes from the beginning of the Node structure,
then the address sequence of the above code is as follows:
[p], [p + 12],
[

















As seen in the above sequence, the address of each node depends on the address of the
previous node, in that each address that is accessed next is read from the current node. These
nodes need not be in contiguous memory locations as they may be allocated space at different
times during program execution and the LDS may have been modified with insertion and deletion
of nodes prior to the execution of the above loop. Hence the addresses that will be accessed each
timep (or a node) is accessed will form an arbitrary sequence infeasible to predict.
10
2.3 Linked Data Structures
2.3.1 Linked Data Structure
A data structureis a way of storing related data in memory, so that algorithms can manipulate it
efficiently as a unit. When the memory requirement of a program in not known in the beginning of
program execution, memory is allocated as and when it is needed. Call each piece of memory that
is allocated for an LDS element anode. Many times these nodes are linked to each other by storing
the address of one (or more) node(s) in another. The structure thus created is called alink data
structure(LDS).
2.3.2 Classification of Loads Accessing a LDS
Loads accessing a LDS may be classified as follows [2]:
Traversal Loads: Loads that sequentially traverse the nodes of a LDS.
Pointer Loads: Loads that access values within the node that will be subsequently used as an
address for a load/store instruction.
Direct Loads: Loads that read values within the node and use them for computation operations
rather than as an address in future load/store instructions.
Indirect Loads: Loads which use the result of a pointer load as the address and access data from
pointers other than the primary linked traversal pointer.
2.3.3 Locality in LDS
2.3.3.1 Spatial Locality in LDS
LDS are created dynamically, with memory being allocated to different nodes at different times,
and may also be modified during program execution. These reasons cause consecutive nodes of
the LDS to have little spatial locality.
11
2.3.3.2 Temporal Locality in LDS
LDS may have low temporal locality as the lines containing the nodes may be evicted from the
cache while the program traverses other nodes before coming back to the previously accessed ones
again [13].
2.4 Prefetch Concepts
Data prefetchis a technique to hide memory latency by bringing data into the cache before the
processor needs it thus overlapping data access with pre-miss computations. When data isdemand
fetchedas in a conventional system, a line is brought into the cache from the main memory only
after there is request for data that is not present in the cache. This results in poor processor uti-
lization, as instructions wait for the memory request to be satisfied. This can be avoided if the
address that will be accessed next is determined and prefetched. Prefetch concepts in this section
are discussed with regard to uniprocessor environments only.
2.4.1 Metrics to Evaluate a Data Prefetch Mechanism
The two most important parameters that quantify the benefit of a prefetch mechanism are itsaccu-
racyandcoverage.
Coverage is defined as the ratio of the number of used prefetches to the number of misses
on an otherwise identical system without prefetching. It is the percentage of misses that were
avoided by the prefetched data. A high coverage value implies that a large number of loads that
would otherwise miss were serviced with prefetched data.
Accuracy is defined as the ratio of the number of prefetched lines that were accessed to the
total number of lines that were prefetched. It is the percentage of time prefetched addresses are
accessed by the program before being evicted.
Low accuracy indicates that most prefetches were issued for addresses that were not ac-
cessed by the processor. When the accuracy of a prefetch mechanism is low, memory bandwidth
is wasted as unwanted data is brought into the cache. High accuracy with low coverage provides
12
poor performance enhancement as the prefetch mechanism issues prefetches mostly for correct
addresses but ones which were already cached.
Timelinessis another parameter used to evaluate a prefetch scheme. A prefetch is said to be
timely if it was brought into the cache just when it was needed. While discussing timeliness, we
categorize prefetches in the same manner as they were in CDCAP [2]. A prefetch is calledtimely
if data arrived into the cache within one-quarter of the memory latency. They are calledgood if
data arrived within one-half of the memory latency. Those prefetches for which data arrived within
three-quarters of memory latency are calledacceptableand the ones for which earlier or later than
three-quarters of memory latency are called aspoor.
In other words, timeliness is a measure of the amount of time a load instruction had to
wait before data was brought into the cache. It is best to have prefetches scheduled such that data
arrives just in time for the load instruction that needs it, so that execution is not slowed. The
improvements in performance are low if data is brought into the cache too early or too late, as
low timeliness increases the likelihood of either a damaged prefetch or cache pollution. It is not
possible to accurately predict when to schedule a prefetch so that data arrives in the cache at the
moment it will be requested by the processor as memory latency and time to execute a set of
instructions vary in different runs.
Cache pollutionandmemory bandwidth requirementare side-effects of data prefetching
that also have to be considered. Cache pollution is when a prefetched line evicts data that will be
accessed before the prefetched data, thereby resulting in a cache miss which could be avoided if
prefetching were turned off. However, if the prefetched line was accessed before the evicted data,
it would be considered as an ordinary replacement miss rather than cache pollution as it would
occur with or without prefetching.
Bandwidth requirement is the average number of memory requests (demand-fetches and
prefetches) that will be issued by both the program and the prefetch scheme in unit time multiplied
by the size of each data transfered for each request. Prefetching demands higher bandwidth to
sustain the increased frequency of memory requests issued by the processor resulting from some
processor stall cycles being avoided. Unfortunately, the bandwidth of the memory system has to be
13
larger than as indicated by improved performance, to accommodate additional memory traffic due
to unnecessary prefetches issued for data that is not really needed by the processor, as the prefetch
technique may not be 100% accurate.
2.4.2 Approaches To Prefetch
A data prefetch scheme may use only hardware, software or both to achieve prefetching.Soft-
ware prefetchingrelies on the programmer or compiler to insertp efetch instructionsbased on
the program’s statically determined memory access patterns. These schemes incur processor over-
head caused due to significant code expansion from the prefetch instructions and any instructions
needed to compute their addresses and issuing prefetches. However, most difficulty lies in schedul-
ing prefetches early enough to hide considerable memory latency and achieve performance gain
by more than compensating the processor overhead.
Prefetch instructions can be eitherbinding, meaning the fetched data is stored in or is bound
to a register ornon-binding, meaning the fetched data is brought into a memory layer closer to the
processor but not assigned to a register. The prefetch scheme can choose to place prefetched data
in either L1 or L2 cache or a separate buffer for non-binding prefetches.
Hardware prefetchtechniques dynamically predict addresses and issue prefetches for them.
They rely on the program’s runtime behavior and look for patterns, say in previous memory access
sequences, data brought into the cache, cache miss sequences, etc., to predict addresses. This re-
sults in a lot of unwanted data to be prefetched, and significant increases in bandwidth requirement
and likelihood of cache pollution. However, these schemes tend to be more timely than software
methods, as addresses can be predicted sufficiently before they are referenced by the processor.
2.4.3 Prefetch Results
The outcome of a prefetch instruction can be classified as follows. A prefetched line that arrives
into the cache before it is accessed is agood prefetchas the memory latency is completely masked.
When a prefetched line arrives after the load accessing it arrives, it is called alate prefetchand the
observed memory latency is reduced in this case. Sometimes prefetched lines are never accessed
14
as the prefetch technique may not be 100% accurate. Such prefetches are calledunus prefetches
and contribute to cache pollution and wastage of memory bandwidth. The situation is worse when
a prefetched line evicts a line that will be accessed. Such prefetches increase memory latency
rather than hiding it and are calleddamaging prefetches.
Chapter 3
Literature Review
Caches started appearing in 1960’s [1] and though they went a long way in reducing average
memory access latency, average memory access time was still a significant part of execution time
and by 1978 investigators started reporting on the first prefetch schemes [15]. These schemes relied
on predicting the the address that will be accessed next based on the current address reference
pattern and spatial locality in programs allowed for easy address prediction (see Section 2).
Fetching lines larger than the size of referenced data is effectively prefetching, as the sur-
rounding data is brought into the cache before it is referenced. The frequency of misses in pro-
grams that exhibit good spatial locality can be reduced by making cache lines long as more data
is prefetched. However, when cache lines are brought on demand into the cache, there will still
be a miss for each such subsequent prefetch (when referenced data belongs to an uncached line),
and true prefetch schemes avoid these misses requiring only one or two misses under favorable
circumstances.
Hardware prefetching of separate cache lines was implemented in the IBM 370/168 and
Amdahl 470V in 1975 [17]. The idea ofsoftware prefetchwas mentioned by Smith [15] in 1978,
but it was not realized until 1989, when Porterfield proposed the idea of a “cache load instruc-
tion” [14]. Such instructions, now called prefetch instructions, were later implemented in several
instruction-set architectures and are common in new instruction-set architectures. Non-blocking
loads are an example of binding prefetch instructions, in that they are issued in advance of the
data’s actual use and load data into a register, and instead of predicting the address they make use
of the available parallelism to compute them.
Researchers have since then vigorously pursued prefetch techniques to reduce or hide mem-
15
16
ory latency and improve performance. Prefetch methods can be hardware or software based, or use
a combined hardware/software approach. It will become evident in the following discussion that
as different prefetch techniques use different approaches to predict prefetch addresses and differ-
ent applications have different runtime behaviors and data access patterns, different applications
do better with different prefetch techniques and it is not possible to achieve the best results for all
applications with any one particular prefetch scheme.
Non-blockingor lockup-free cache, which can tolerate multiple outstanding misses and
allow execution to proceed while the instruction that actually needs the data waits, thereby hiding
memory latency by overlapping data access with post-miss instructions was proposed by Kroft [11]
in 1981. These caches are sensitive to increase in memory latency as they require sufficientnon-
blocking distance, the distance between load instruction and the first instruction dependent on it, to
completely mask memory latency. Compiler techniques such as,dynamic instruction scheduling,
to increase the distance between load instruction and the first instruction dependent on it, and
static register renamingandout-of-order execution(with large reorder buffer), to uncover more
instruction-level parallelism and allow greater freedom in rescheduling instructions were being
investigated to hide some miss latency in non-blocking caches [6]. Such caches typically cover
L1 miss/L2 hit latency in modern dynamically scheduled systems but cannot cover the latency of
accesses that miss the entire cache.
In 1992, Chen and Baer studied the effectiveness of non-blocking and prefetching caches
and proposed a hybrid design in which a prefetch hint is provided to the memory subsystem prior
to the load instruction and the binding if the loaded value with a register is delayed until the value
is actually used [6]. They concluded that prefetch caches performed better than the nonblocking
caches and were less sensitive to memory latency and asserted that a pure hardware approach could
outperform software based prefetch techniques as they can be more aggressive and timely.
3.1 Software Prefetch Techniques
Software prefetching relies on the programmer or compiler to insert prefetch instructions based
on the program’s statically determined memory access patterns. These instructions must be non-
17
blocking (allow execution to proceed until the actual instruction that needs the data arrives) and
therefore need non-blocking caches. Prefetch instructions are intended only to enhance perfor-
mance and do not affect program correctness, hence they are not allowed to raise exceptions that
an ordinary load would for a given address to avoid unnecessary overhead. For instance, the pro-
gram need not be terminated if the prefetch address is invalid, and even if it is merely wrong, the
program need not wait for a page fault to be serviced when the needed page is swapped in. As
a compiler is used to reliably predict memory access patterns, software prefetch schemes attempt
to benefit from code transformations mainly in loops. These are useful because effective prefetch
instructions can be easily found for loops accessing large arrays for which addressing patterns are
easy to predict; such loops are common in many scientific applications.
These schemes do not require hardware to predict addresses, but incur processor over-
head caused due to significant code expansion from instructions computing addresses and issuing
prefetches. However, most difficulty lies in scheduling prefetches early enough to hide consid-
erable memory latency and achieve performance gain by more than compensating the processor
overhead. This is difficult because the prefetch instructions are statically generated and so ad-
dresses cannot be determined sufficiently in advance, the time between the prefetch and the in-
struction needing the prefetched data and memory latency vary unpredictably during execution,
and it is unable to detect when a prefetched block has been prematurely evicted and needs to be
prefetched. Hence, improvement is limited mainly to array-based numeric applications and they
do not fare very well in pointer-based applications even after using compiler optimizations, such
as instruction scheduling, loop unrolling, etc. However, the occurrence of a cache miss depends
on many runtime factors (which are visible to hardware), such as input data, cache size, temporal
locality, etc, and even when the scheme is 100% accurate, prefetches issued for data already in the
cache are unnecessary and the benefit of prefetching is diminished [17]
3.1.1 Thread-based Prefetch
Simultaneous multithreading can be used to hide the latency of a cache miss in one thread by
executing the instructions of another thread. Multithreaded processors with fast context-switching
18
can achieve a similar benefit. It does not require the ability to predict data address ahead of time,
but requires significant hardware to minimize thread-switching overhead if the number of threads
are large. The technique is only effective when more than one thread is active at the time of the
cache miss, which is not the case for many programs.
Multithreaded prefetchers execute code in another thread context, attempting to bring data
into shared cache before the primary thread accesses it. These prefetchers are accurate as they
use code from the actual instruction stream to compute load addresses. However, they have a
potential shortcoming of cache misses preventing the thread from going ahead of the main thread,
as it is difficult to write multithreaded programs that can hide cache misses, due to control and
data dependencies in the code, while prefetching has the potential to accelerate a single thread
of execution [7, 13]. Some thread-based prefetchers include, Execution-based Prediction Using
Speculative Slices [19] and Data Prefetching by Dependence Graph Precomputation [3], but these
are not conventional multithreading or SMT and run on custom systems.
3.1.2 SPAID Heuristic
In 1995, Lipastiet al proposed “Speculatively Prefetching Anticipated Interprocedural Derefer-
ences” (SPAID), a compile-time heuristic that utilizes prefetch instructions in pointer- and call-
intensive applications. It inserts prefetch instructions at call sites, for data referenced by pointers,
passed as arguments on procedure calls. This heuristic provides little improvement to applica-
tions which have fewer call sites due to longer procedures, do not pass pointers as arguments, or
are characterized such that prefetches are mostly issued for data that is already in the cache. Re-
sults indicate that there is not much improvement, over inserting single prefetch, when multiple
prefetches are inserted per call site, as there is tendency that more prefetches are for issued for
cached data or they are for pointers that will not be dereferenced by the program [12].
3.2 Hardware Prefetch Techniques
Hardware prefetch techniques do not have compiler support to determine which addresses will be
needed next instead using hardware to dynamically predict addresses and issue prefetches using
19
them. They rely on the program’s runtime behavior and look for patterns, say in previous memory
access sequences, data brought into the cache, cache miss sequences, etc, to predict addresses.
For instance, they can be designed to detect constant strides in the access sequence of a load
instruction and issue prefetches for addresses next in the sequence, as in the stride prefetcher.
As there no support from the compiler, code compatibility is maintained and processor overhead
can be avoided, at the cost of extra hardware resources, as separate cache ports can be dedicated
to prefetch rather than the processor issuing requests to the memory subsystem as in software
prefetch.
These schemes are based on prediction rather than knowledge about the program’s future
access sequences, and so are not as selective while issuing prefetches to the memory subsystem.
This can result in decreased accuracy, as a lot of unwanted data is prefetched, and significant
increases in bandwidth requirement and likelihood of cache pollution. Also, it is difficult for
hardware to predict arbitrary access patterns, like ones which occur with LDS. These cannot be
predicted from just the reference stream, and their accuracy also generally drops when running too
far ahead of the program, as it is based on prediction based on a pattern that does not exist. How-
ever, these schemes tend to be more timely than software methods, as addresses can be predicted
sufficiently before they are referenced by the processor [17].
3.2.1 Sequential Prefetch
This is the simplest form of hardware prefetching in which memory access to a cache line triggers
prefetch for lines adjacent to it, thereby taking advantage of spatial locality of references. For
example, consider a code which accesses the contiguous elements of an array. When it accesses
the array’s first element, elements around it are prefetched and brought into the cache before they
needed by the code. As the code proceeds and accesses the array sequentially, more elements
are brought into the cache and so can be referenced by the code without resulting in cache misses.
This method helps to avoid a large number of misses, even in programs that are not array-intensive,
because almost all programs exhibit good spatial locality in memory reference patterns.
Sequential prefetch schemes can be categorized depending on the type of event that triggers
20
prefetch. For example,tagged prefetchor prefetch-on-referenceschemes associate a tag bit with
each line to determine when it is first accessed and prefetch for linel + 1 is issued when linel
is accessed for the first time. The linel may either be a demand-fetched or prefetched line. In
prefetch-on-misschemes linel + 1 is prefetched when access to linel results in a cache miss.
There are many variations of this technique, with differences in the number of lines prefetched.
The number of lines that are prefetched at a time is called theegree of prefetching. This is varied
during runtime inadaptive sequential prefetchdepending on the amount of spatial locality exhib-
ited by the program estimated using aprefetch efficiencymeasure computed dynamically as the
ratio of useful prefetches to total prefetches. For example,on -block lookahead, which initiates
prefetch for linel + 1 when linel is accessed.
One-block lookahead tagged prefetch produces better results than one-block lookahead
prefetch-on-miss due to algorithm behavior and though adaptive prefetching can achieve lower
cache miss ratios than one-block lookahead tagged prefetch, improvement was found to be nulli-
fied by increased memory traffic and contention [17].
Sequential prefetch is different from doubling the cache block size in that the prefetched
lines are treated separately with regard to cache replacement and coherence policies. Smaller line
sizes are better as there is more room for actively used data and the probability offalse-sharing,
the event when two or more processors wish to access different words within the same cache line
and at least one of them is a write, is reduced [17].
3.2.2 Stride Prefetch
Sequential prefetch does little good when programs access data in astride pattern, a pattern in
which addresses increase by a fixed amount larger than the line size, thestride. For example,
the access sequence of a load may be 0x10100, 0x10200, 0x10300,... and so on. Clearly, such
sequences are not hard to predict. In 1991, Chen and Baer proposed a mechanism which monitors
the address reference pattern of different load instructions to detect constant stride array references
that might originate from looping structures. Once a pattern is established, prefetches are issued
to the address predicted to be the next in line. This method needs some warm-up time to detect
21
strides before it can issue prefetches, unlike software prefetch schemes, and it may take several
iterations for it to achieve a prefetch distance that completely hides memory latency [17].
3.2.3 Prefetch Using Stream Buffers
Hardware schemes typically fetch a lot of unwanted data into the cache, many times replacing data
that is needed by the processor. In 1990, Jouppi proposed using separate buffers,stream buffers, to
store a stream of prefetched sequential cache lines, independent of program context. In the event
of a cache miss, stream buffers are looked up for an entry that matches the miss address, if found,
the corresponding stream buffer is allocated, the data is taken into the cache and the following lines
in the stream buffer are moved up in the queue. Then, prefetches are launched for the sequentially
consecutive cache lines to be brought into the buffer. If data is not found at the head of the allocated
buffer (even if it is present somewhere else in it), it is flushed and a new stream buffer is allocated in
LRU order and again, prefetches are launched for the stream. This scheme avoids cache pollution
as data is not placed directly into the cache, however lines must be fetched in the order they are
accessed for the stream buffer to work, because if data is not found at the head of a buffer, after a
cache miss, it is flushed and data is prefetched again [17].
3.2.4 Prefetch Using Markov Predictor
Proposed by Joseph and Grunwald, this hardware prefetch scheme uses Markov predictor as an
interface between on-chip and off-chip cache to prefetch multiple reference predictions based on
an assigned priority. The Markov predictor may use any of the several sources of prediction. The
address reference stream can be used to predict the prefetch address, but it requires the predictor
hardware to be placed on the same chip as the processor and requires the hardware to be very effi-
cient, as it may have to analyze many addresses in one cycle, and large tables to store information
about every address accessed by the program. Alternatively, the miss address stream is used as it
is visible to the off-chip cache and is less frequent. This is important as the prefetcher may need to
access large tables to be effective.
As addresses miss in the on-chip cache, a directed graph is created with the miss addresses
22
as the nodes and the miss reference sequence as the edges. The edge from nodeX to nodeY is
assigned a weight equal to the fraction of all missed references toX that were followed by missed
references toY . For example, if misses to address 0x10100 were followed by misses to 0x10200
three times and 0x10300 once, then the edge from 0x10100 to 0x10200 will be labeled three-
fourth and from 0x10100 to 0x10300 one-fourth as total misses to 0x10100 are four. If the program
executes again and issues the same memory sequence, the miss pattern will repeat and the predictor
can issue preftches for them. For instance, if there is a miss to 0x10100, the predictor will prefetch
0x10200 first followed by 0x10300. The prefetched data is placed in a separaterefetch bufferso
as to not disturb the cache contents, and the miss address is used to index into the prediction table
and provide the next set of possible addresses that previously followed this miss address. However,
the Markov predictor is encumbered with the following drawbacks: programs may not repeat the
same address pattern while re-executing, each node can have an arbitrary degree with different
missing addresses following it and it may be difficult to predict accurately, and programs access
many addresses and it may be difficult to record all references in a table. Also, after one set of
addresses are prefetched the hardware remains idle until there is another miss and does not use the
predicted addresses to continue with prefetching [9].
3.2.5 Prefetch by Dependence Graph Precomputation
Unlike the stride and sequential schemes, this executes the actual program and so can prefetch
arbitrary address sequences. This was proposed by Annavaramet l in [3], to prefetch data for
load/store instructions that frequently miss in the cache. When a memory access instruction enters
the instruction fetch queue, a dependence graph of those instructions yet to be executed, that deter-
mine the address of the load/store instruction is generated and executed, to compute the address.
The fetch queue and reorder buffer delays are avoided to obtain the address early enough to issue
a timely prefetch. One drawback is that, it is required to generate dependence graphs at runtime,
and this not only increases hardware complexity, but also, little latency is hidden, as the prefetch
is issued only after the instruction enters the instruction fetch queue and the dependence graph is
constructed and computed. Also, if there is an uncommitted store to the same address as the load,
23
it will not be detected by the graph generator and will result in an incorrect address. Non-blocking
caches with dynamically scheduled systems can provide the same sort of prefetch, though perhaps
at greater expense.
3.3 Hardware/Software Prefetch for Dynamic Data Structures
As seen in the above discussion, several techniques to achieve prefetching for regular access pat-
terns have been developed. These patterns are easy to predict and the techniques successfully
eliminate a lot of cache misses. However, many cache misses still result from loads having ir-
regular access patterns, a considerable number of which can be attributed to memory accesses to
LDS, and can cause severe performance degradation. Many of the cache misses which occur while
accessing the nodes of a LDS are caused because there is little spatial locality between consec-
utively accessed nodes, as they are allocated at different times during program runtime, and the
temporal locality may be low [13]. Prefetch using Markov predictor and thread-based prefetch like
prefetch using dependence graph precomputation, have a chance of prefetching along a LDS with
reasonable accuracy unlike stride and sequential which would have low accuracy and coverage.
As these patterns are characterized with low locality memory references which are difficult
to predict and occur largely in code accessing LDS, more elaborate techniques aimed at prefetch-
ing the nodes of LDS and dealing with the pointer-chasing problem are needed and have been
proposed. Some techniques overlap the latency of the prefetch that fetches the next node in the
LDS, with the work between two consecutive LDS accesses. However, the computation time on
each node between successive accesses is not always sufficient to hide all the memory latency.
Some techniques add pointers between non-successive nodes to launch prefetches aggressively
and hide as much latency as possible even when there is little work in each iteration.




In this method, whenever a node of a linked data structure is visited, all pointers within that node
to the neighboring nodes are prefetched hoping that at least one of them will be used next. Though
this scheme is simple and has low runtime overhead, it has the drawback of having insufficient
prefetch distance coupled with the possibility issuance of many prefetches that are unnecessary.
3.3.2 History-Pointer Prefetch
The access pattern is memorized during the first traversal of the LDS, and prefetches are issued
sequentially to the stored addresses when the mechanism detects that the same addresses are being
accessed again, hoping that they will be accessed next. The drawbacks of this method include the
absence of prefetching the first time the LDS is traversed and it is useful only when the data access
pattern does not change over subsequent traversals. Luket alsuggest the use of greedy prefetching
first time the LDS is traversed.
3.3.3 Data-Linearization Prefetch
The idea is to map heap allocated nodes that are likely to be accessed close together in time into
contiguous memory locations, to increase available spatial locality and be able to predict the ad-
dress to prefetch without any pointer dereferencing. One way to implement this idea is to dynam-
ically remap data after the LDS is constructed. This will incur large runtime overheads. The map
may be generated when the LDS is being created, to reduce overhead, if it is traversed in the same
order in which it is created and will change very little or preferably not at all after creation. If the
LDS changes rapidly, prefetching will not improve performance [13].
3.3.4 Prefetch Using Jump Pointers and Prefetch Arrays
Proposed by Karlssonet al in 2000, this software method is an extension of the Greedy method
suggested by Luket al [13] to hide load latencies of LDS traversal when the traversal path is not
known a priori and there is little computation (insufficient to completely hide prefetch latency of
the next node) for each node. This scheme usesjump pointers(pointers to nodes away from the
25
head of the LDS) andprefetch arrays(arrays containing jump pointers) to prefetch nodes that might
be accessed in successive iterations aggressively in parallel. A fixed number of jump pointers may
be associated with each node, and when a node is brought into the cache, the nodes pointed to by
the jump pointers are prefetched. This has a disadvantage that the nodes closer to the head of the
LDS will not be prefetched. To overcome this problem, prefetch arrays are used to prefetch the
first few nodes of the list and the jump pointers are used for prefetching the subsequent nodes, the
prefetch array has pointers to nodes, closer to the head of the LDS. In a hardware approach of the
same scheme, the software prefetch instruction passes the address of the first element and size of
the array to the prefetch engine, which issues prefetches for all the nodes pointed to by the array
and stops when it reaches the end of the array. This method also has to incur the overhead of insert
and delete operations on the LDS [10].
3.3.5 Pointer-Cache Assisted Prefetch
This is a more recent technique proposed by Collinset al in 2002, targeted to speed up recurrent
pointer accesses. Apointer cacheis used to store mappings between heap pointers and the address
of the heap object they point to if the address of the pointer and the address of the object it points
to fall within the range of the heap, providing a compressed representation of the important pointer
transitions for the program. Its function is to break the serial dependence chains in pointer chasing
code. With simple hardware support, a load is identified as a pointer load if it did not point to the
stack (for better performance) and the upperN bits of itseffective addressmatch the upperN bits
of the value being loaded [7]. The pointer cache is looked up, with the load’s effective address,
simultaneously with the data cache. If it results in a hit, the pointer cache provides the predicted
value for the load, which might be the address of an object that will accessed by subsequent loads,
and so a prefetch is issued it. A prediction error may be handled by either re-executing the instruc-
tions or flushing the pipeline. This method has an advantage that modifications to the LDS can be
handled by simple detection usingstore teachingand an update in the pointer cache. Also, value
prediction of the address allows multiple loop iterations to be executed in parallel.
26
3.3.6 Compiler Directed Content Aware Prefetching
As is evident from the above discussion, several approaches to prefetching LDS have been formu-
lated. Since pointer address sequences can be totally arbitrary the only way to determine them is to
either remember them or to execute a condensed version of the code. Memorization, done in part
by the Markov and prefetch array schemes requires impractically large tables. Prefetch instructions
or prefetch threads may slow down execution when prefetch is not necessary. The advantages of
software’s prefetch accuracy and hardware’s ability to issue timely prefetches are encountered are
achieved in Compiler-Directed Content Aware Prefetching for Dynamic Data Structure (CDCAP),
proposed by Sukhniet al [2].
CDCAP is a hardware-software technique for prefetching data for linked data structures
(LDS) and arrays of pointers. In this method, the compiler, with the help of profile information,
generates directives, based on knowledge of pointer uses in a structure to prefetch linked data
structures. The directives follow links based on data structure layout information provided by the
compiler. This mechanism uses three types of prefetch directives to inform the hardware about how
to obtain next addresses to prefetch. Each directive has a unique ID. The compiler inserts the ID
of the directive that should be issued, as hints, into load instructions which fetch the base address
of a LDS. This hint is consumed dynamically by the hardware to issue the appropriate prefetch
instruction and eliminates the need of prior knowledge of traversals and the hardware to detect
data structure layout dynamically. Compiler inserted directives are also used to read prefetched
data incoming into the memory subsystem to issue further prefetches. In this way CDCAP behaves
like thread-based prefetch techniques. However, as it can be implemented at different levels of the
memory hierarchy, the hardware can issue the next prefetch instruction as soon as a node of the
LDS is brought into a particular level instead of waiting, like in thread-based prefetchers, for the
node to reach the level closest to the processor. Also, it need not consume precious bandwidth at
the higher levels of the memory hierarchy and can reduce and better tolerate cache pollution.
Chapter 4
Compiler Assisted Cache Prefetch Using
Procedure Call Hierarchy
The proposed Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy (CAPPH) scheme
is a hardware-software technique capable of prefetching some arbitrary data access sequences that
other prefetch schemes cannot. It extends the Compiler-Directed Content-Aware Prefetching for
Dynamic Data Structures (CDCAP) proposed by Sukhniet al [2]. Like CDCAP, it uses the com-
piler to provide information on data structure layout. It also uses static data-flow analysis and
knowledge of procedure-call hierarchy to issue prefetches for loads that are within loops and at
loop and procedure boundaries to createprefetch directives. The data-flow analysis provides infor-
mation about when memory is allocated to the first node of the LDS and which links of the LDS
are accessed in the different procedures of the program. CAPPH can prefetch data for procedures
even before they are called by using this information. It is also capable of issuing prefetches for
recursive functions that access LDS and arbitrary access sequences which are otherwise difficult
to prefetch.
4.1 Data Structures
CAPPH is implemented using three tables, a stack and a queue. The tables are used to store
prefetch directives and data needed to compute addresses. The stack is needed for CAPPH to
handle procedure calls and returns to be able to prefetch data for a procedure even before it is





The directive table (DT) contains prefetch directives of different functions (used inter-changeably
with procedures) of the program. Each entry of this table specifies a directive to be executed by
the hardware prefetch engine and comprises of anindexfield, aPC field, anactionfield and four
argumentfields. A non-zero PC value specifies that the directive must be issued in the context
of the application’s instruction, which has the same PC as the directive, when it is in its commit
stage. On the other hand, if the PC value of a directive is zero, it specifies that the directive is not
associated with any instruction of the application program and is to be issued after the completion
of the previously issued directive and after a fixed predetermined interval from the issuance of the
previous directive, whichever is later.
4.1.2 Function Information Table
The function information table (FIT) table is looked up each time a directive specifying a procedure
call or procedure return is executed. It is indexed byfunction-id, a unique identifier assigned to all
functions which have prefetch directives present in the directive table. The specific ID to look up is
obtained from the directive itself for a procedure call or function stack for a procedure return. Each
entry of this table comprises of three fields namely,function-start, function-end, which specify
the range of entries in DT the hardware may look up while issuing directives andw it-time, the
statically-determined, minimum period of time the hardware has to wait between issuing successive
directives during the execution of the corresponding function.
4.1.3 Function Stack
The prefetch mechanism uses a separate stack to keep track of where it is in terms of procedure
calls and returns, so that it can prefetch data for procedures well before they are called in the
application. Each time a directive specifying a procedure call is executed, information pertaining
to the current procedure is pushed on to this stack and mechanism variables are assigned new
values from the FIT. Each time a directive specifying a function return is executed, the mechanism
variables are updated with the data popped off this stack.
29
4.1.4 Memory Instruction Queue
Some prefetch directives issue more than one memory reference instruction, each of which is
dependent on the previous. In such cases, the prefetches are pushed into a queue, and issued in
FIFO order, one after the completion of the other. The execution of such prefetch directives is
considered to have come to completion only after all the prefetches issued by it are satisfied.
4.1.5 Data Table
The data table is used to store data that is collected or computed during program execution.
4.2 Mechanism
Prefetch directives are created based on static analysis of the program. They are passed on to
the hardware at runtime and loaded into tables. The hardware prefetch engine uses DT to store
directives which will be executed by the hardware, and the FIT to store information that delimits
the scope of entries in the DT, from which directives can be found at a given point in program
execution.
Figure 4.1: CAPPH Hardware Overview
30
When execution starts, the first entry in DT becomes thecurrent directive, that is a directive
that can be issued if the following conditions are satisfied. For each machine instruction that com-
mits, the hardware performs an associative lookup on program counter (PC), within the scope spec-
ified by FIT, to determine whether or not a prefetch directive has to be issued. A prefetch directive
is issued if the PC of the committing instruction matches the with the PC the directive. Otherwise,
the current directive may be issued if its PC is equal to zero, and if the statically-determined inter-
val has elapsed since the last directive was issued or if all previously issued memory requests are
complete, whichever is later. Unless determined otherwise by the most recently executed directive,
the directive immediately after it, becomes the current directive. As each directive is executed,
prefetching is achieved.
In order to ensure that the prefetch mechanism does not go too far ahead of the program,
some form of synchronization between them is required. Currently, the PC of committing instruc-
tions and thecurrent window pointer(CWP) is used to serve this purpose as it is easy to imple-
ment in an initial study of this mechanism. The current window pointer, provided in the SPARC
instruction-set architecture, is used to determine the level in the call hierarchy in which an instance
of a function is executing. CAPPH uses a separate current window pointer, call it CWP*, to keep
track of the level in the call hierarchy it is prefetching for. PC and CWP effectively synchronize
many programs with CAPPH. Synchronization is achieved by allowing directives with a non-zero
PC value to execute only when an instruction with the same PC commits and CWP is equal to
CWP*. This ensures that the directive instance is executed in the context of a matching instruction
instance. CWP is particularly needed to prefetch data for recursive functions so that CAPPH is not
misled by an instruction with the same PC executing at a different level in the call hierarchy.
Another benefit of placing CAPPH in the commit stage is that some directives read register
values of instructions and their correct values are available only then. However, this simplification
comes with trade-off of lower coverage values as the hardware sometimes has to wait for an in-
struction with a specified PC to commit before it can continue to execute directives. Also the PC
and CWP are not sufficient for synchronization when a function makes more than one recursive
call because one cannot tell how many times the function has already called itself in a particular
31
level at a given point in time.
Figure 4.2: Synchronization Problem Example
As a simple example of the synchronization problem, consider the situation depicted in
figure 4.2. FunctionA calls functionB twice. Call each instance ofB, B1 andB2 respectively.
Assume that the CAPPH is well ahead of the program and is issuing directives forB1 when
the program is executingA. If CAPPH completesB1 and starts prefetching forB2, and the
program starts executingB1 there is no way for CAPPH to identify that the instance of function it
is prefetching for is different from the instance executed by the program, as both the PC and CWP
are the same. For this reason, prefetching forB2 has to wait till processor finishes executingB1.
4.2.1 Example Using Health Benchmark
To illustrate, directives and their use will be shown for health, one of the Olden benchmarks. The
health program is a “Columbian Health Care Simulator” which uses doubly-linked trees to repre-
sent a hierarchy of hospitals and doubly-linked lists at each node (hospital) to represent patients.
Patients arrive periodically at each hospital and might have to wait for a free physician to attend
them. The physician assesses their ailment and either transfers them to the next hospital up the
hierarchy or treats them. The height of the tree and the time the program has to run for are given
as inputs.[8]
Table 4.1 shows code from a few functions in the health benchmark and corresponding
prefetch directives. The first set of directives are for the main function and when execution starts
the first entry in DT becomes the current directive. Table 4.2 shows sample FIT for the DT in table
4.1
32
Table 4.1: Select Code With Corresponding Directive Table Entries
Directive Table
High-level code ID PC Act. Arg1 Arg2 Arg3 Arg4
Function 0 0 0x10078 d 9 1 0 0
int main(int argc, char* argv){
dealwithargs(argc, argv); 3 0x10c44 f 1 0 0 0
top = alloc tree(maxlevel, 0, NumNodes,
..);
4 0x10c60 f 2 0 0 0
. . .
} 9 0x10d10 e 1 0 0 0
Function 1
void dealwithargs(int argc, char*argv){ 7 0x107f4 s 1 2 0 0
max level = atoi(argv[1]); 8 0x10800 s 2 2 0 0
. . . 9 0x10818 s 3 2 0 0
} 10 0 e 0 0 0 0
Function 2
struct Village *alloctree(int level, , Village
*back){
11 0x10854 a 3 0 0 0
if (level == 0) return NULL; 12 0x10860 s 4 1 1 0
else{ 13 0x108b8 f 2 0 0 0
new = (struct Village *)ALLOC(lo,
sizeof(struct Village));
14 0x108ec f 2 0 0 0
for (i = 3; i > 0; i- -) 15 0x108f0 z 4 0x108f40 0
fval[i].value = (unsigned long int
*)alloc tree(. . . );
16 0 o 32 24 0 0
. . . 17 0 d 32 0 0 0
new→ hosp.personnel = (int)pow(2, level -
1);
18 0 a 2 -3 0 0
new→ hosp.freepersonnel = new→
hosp.personnel;
19 0x1098c e 6 0 0 0




void addList(struct List *list, struct Patient
*patient){
123 0 t 62 0 -1 -1
while (list != NULL) { 124 0 0 0 0 0 0
b = list; 125 0x1116c e 0 0 0 0
list = list→ forward;}





Table 4.2: Function Information Table
ID Function Start Function End Wait Time
0 0 6 1
1 7 10 0
2 11 19 6
. . . . . . . . .
7 123 125 0
4.2.1.1 Procedure Call Example
Figure 4.3: PSE Output Showing Procedure Call Directive Fordealwithargs
The execution of a procedure call directive can be seen in figure 4.3. As soon as main func-
tion is entered at PC: 0x10c3c, function dealwithargs is called at PC: 0x10c44. This information
is provided to the hardware with the function-call directive and the ID of dealwithargs is provided
by arg1 of the directive. The FIT is looked up with this function ID as index and the range of di-
34
rectives in the DT which correspond to the function is provided by the function start and function
end values in that entry. The first entry of the called function becomes the current directive. After
this directive is executed, the hardware knows to look up directives between ID seven and ten for
prefetch instructions, and directive seven becomes the current directive.
4.2.1.2 Store Example
The dealwithargs function reads the program’s input arguments. These arguments are stored by
CAPPH in its data array assuming that they might be useful later in program execution to issue
prefetches for data early enough. This is accomplished by the store directive. This directive is
issued in the context of a particular instruction and can read the value of the instruction’s rs1, rs2
or rd registers and the effective address accessed by the it if it is a load/store instruction.
4.2.1.3 Storing An Input Argument
Figure 4.4: PSE Output Showing Store Directive For Input Variable
35
Figure 4.4 shows CAPPH directive storing one of the inputs (maxlevel). After dealwithargs
reads the input as a string and converts it into integer format, and store it in its assigned memory
location. At that time the directive also reads the value of the rs1 register.
4.2.1.4 Storing A Pointer Variable Assigned During Program Execution
Figure 4.5: PSE Output Showing Store Directive For Pointer Variable
After dealwithargs returns, the main function calls another function alloctree. As the name
suggests alloctree constructs a tree of maxlevel height with each node having four children. As
all linked data traversals start from the head node, it is sufficient if CAPPH remembers its address
to issue prefetches when it knows from static information that the code is going to access the LDS
in the near future. This way it can start with prefetching the very first node continue to prefetch
access sequence that will follow. Figure 4.5 shows CAPPH storing the address of the root node of
the tree that is constructed in alloctree.
36
4.2.1.5 Traversing a LDS
Figure 4.6: Traversing A Linked Data Structure
Function 7, in table 4.1 takes two pointer input parameters,list andpatient. As shown
in the table, a while loop accesses the nodes of the LDS whose base address islist and when it
reaches the end, it allocates memory for another node and inserts patient information. This happens
at every tree node (hospital) when a patient arrives. Before the hardware is issued a function-call
directive when addList is the called function, the prefetch directives of the callee compute or read
from memory the address of base node of the LDS and save it in a location that directive for addList
read from. After the function-call directive for addList is executed, DT entry 123 (first directive
for addList) becomes the current directive. Information in entry 123 indicates that it is a traversal
directive and the address of the LDS node is in sixty-second data array element (I mean data[62]).
Arg3 and arg4 provide offsets from the base address of the node to pointers that will be accessed
in the loop iteration. As arg3 and arg4 are negative one, it indicates that no pointer loads will be
37
accessed during LDS traversal. Arg2 provides the offset to where the address of the next node of
the LDS is stored. So after the node with address in data[62] is prefetched, the address of the next
node is read and stored in data[62]. After the directive is executed, it still remains as the current
directive and will be issued again after statically determined interval or after the completion of the
previously issued directive, whichever is later. It continues to remain current till the end of the
LDS is reached or an instruction with another directive associated with it arrives.
4.2.1.6 Prefetching for Recursive Functions
Figure 4.7: Tree Data Structure Used In Health
CAPPH also prefetches data for recursive functions using the recursive prefetch directive.
Alloc tree is a recursive function which dynamically allocates space to nodes of a tree in a depth-
first manner. When it backtracks towards the top of the tree, after allocating space to a leaf node it
visits intermediate nodes and initializes them. These nodes are prefetched by CAPPH. Entry 15 of
DT in table 4.1 is a directive for recursive prefetch. Arg1 indicates that the nextarg1−1 directives
correspond to recursive prefetch. When instruction with PC: 0x108f0 commits, it means that the
function at that level is going to return may be after some more computation. This directive checks
the return PC of the of the function matches the one in arg2 of the directive. If so, the function
is going to climb up towards the root of the tree and prefetch for data accessed in the function
instance at the next higher level starts. This information of what data to prefetch is provided in the
next 3 directives as specified by arg1. If the return PC of the function does not match, the directive
following this information becomes the current directive, in this case entry 19.
38
4.3 Data Prefetch Directives
Following is the description of each directive that is specified in the prefetching scheme. As men-
tioned earlier, each directive has seven fields:Index, PC, action, arg1, arg2, arg3 and arg4. The
PC field is used to determine whether or notactionhas to be taken in the current cycle, cycle in
which instruction with the same PC commits, whilearg1, arg2, arg3 and arg4are parameters for
theactionto be taken.
Compute ’c’ : Computes and assigns the result of“arg3” ondata[arg2]anddata[arg4] todata[arg1].
data[arg1] = data[arg2]× arg4 if arg3 = 1
data[arg1] = data[arg2] + arg4 if arg3 = 2
data[arg1] = data[arg2]  arg4 if arg3 = 3
data[arg1] = data[arg2]  arg4 if arg3 = 4
data[arg1] = data[arg2]/arg4 if arg3 = 5
data[arg1] = data[arg2]× data[arg4] if arg3 = 6
data[arg1] = data[arg2] + data[arg4] if arg3 = 7
data[arg1] = data[arg2]  data[arg4] if arg3 = 8
data[arg1] = data[arg2]  data[arg4] if arg3 = 9
data[arg1] = data[arg2]/data[arg4] if arg3 = 10
Store ’s’ : Stores the value of source or destination register of the instruction, depending on the
value ofarg2, in thedataarray.
data[arg1] = [regrd] if arg2 = 1
data[arg1] = [regrs1] if arg2 = 2
data[arg1] = [regrs2] if arg2 = 3
Read ’r’ : Reads the contents of a memory location into thedataarray. The address of the memory
location is given by the sum ofdata[arg2] andarg3.




Direct Prefetch ’d’ : Issues a prefetch.data[arg1] is the address that is prefetched.
addr = data[arg1]
prefetch(addr)
Indexed Prefetch ’i’ : The address is obtained by computing the sum ofdata[arg2]anddata[arg3].
Prefetch is issued for the address anddata[arg3] is incremented byarg4.
addr = data[arg2] + data[arg3]
prefetch(addr)
data[arg1] = [addr]
data[arg3] = data[arg3] + arg4
Traversal Prefetch : Issues prefetches for the nodes of a linked data structure. The address of
the first node to be prefetched is provided indata[arg1]. Arg3 andarg4 are offsets from the
node address where addresses which will be accessed as pointer loads within the traversal
loop are present.Arg2 is the offset from the node address where the address of the next node
in the data structure lies.
node addr = data[arg1]
prefetch(nodeaddr)
addr = [node addr + arg3]
prefetch(addr)
addr = [node addr + arg4]
prefetch(addr)
data[arg1] = node addr + arg2 Nodes may be prefetched till the end of the linked data
structure or a fixed number of nodes may be prefetched depending on the action specified.
• Till the end of the list:’t’
• Fixed number of nodes:’u’
Recursive Prefetch ’z’ : This directive is used to initiate prefetch for recursive functions. Prefetch
for data needed by the procedure at higher level starts when this directive is executed.
40
Obtain ’o’ : Given the register number and the window number, this directive retrieves the con-
tents of an architected register. The register number is provided byarg2 and the window
number is provided byarg4 and the contents of the registers are stored inata[arg1] and
data[arg3]. The window number is only needed for instruction-set architectures that have
windowed registers, like SPARC.
data[arg1] = [regarg2]given win num
data[arg3] = [regarg4]given win num
Move ’m’ : The pointer to the directive table is moved byarg1or incremented by 1, depending on
if the conditionarg3computed onarg2andarg4 is trueor false.
cond = (data[arg2] = data[arg4] if arg3 = 1
cond = (data[arg2] < data[arg4] if arg3 = 2
cond = (data[arg2] > data[arg4] if arg3 = 3
cond = (data[arg2] ≥ data[arg4] if arg3 = 4
cond = (data[arg2] 6= data[arg4] if arg3 = 5
cond = (data[arg2] = branchdirection if arg3 = 11
dtptr = dtptr + arg1 if cond = true
dtptr = dtptr + 1 if cond =false
Assign ’a’ : This directive is used to assign value to any of the hardware variables, depending on
arg1.
wait time = arg2 if arg = 1
recursive fn ptr = recursive fn ptr + arg2 if arg = 2
recurse = arg2 if arg = 3
dt ptr = dt ptr + arg2 if arg = 4
win num = din → win num if arg = 5
traverse = arg2 if arg = 6
41
Wipe ’w’ : Deletes all pending requests in the queue.
Function call ’f ’ : This directive tells the hardware to start executing the prefetch directives of the
function whose identifier is provided byarg1. When this directive is executed, the current
values of variables, pertaining to the calling-function are stored in a stack managed by the
prefetch hardware, and new values are assigned such that they correspond to those of the
called-function.
Function return ’0’ or ’e’ : When this directive is executed, values of variables are updated to
those obtained by popping the stack and execution continues with values corresponding to
the function being returned to. In case of’0’ , a condition is checked before confirming
return from the function currently being prefetched. For’e’ on the other hand, no condition





The simulator used for our study is RSIML, a heavily modified version of RSIM.
5.1.1 RSIM
RSIM, acronym forRiceSimulator forILPMultiprocessors, is an open source discrete event-driven
architecture simulator, originally developed at Rice University. “It simulates shared-memory mul-
tiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level
parallelism (ILP). RSIM is execution-driven and models state-of-the-art ILP processors, an ag-
gressive memory system, and a multiprocessor coherence protocol and interconnect, including
contention at all resources” [18]. Compared to other publicly available shared-memory simulators,
RSIM better representative of current and near-future processors. Important features of RSIM
include:
1. Processor simulation features:
• Superscalar - Multiple instruction issue
• Out-of-order (dynamic) scheduling
• Register renaming
• Static and dynamic branch prediction (version 1.0)
• Non-blocking loads and stores
42
43
• Speculative load execution before prior stores are disambiguated
• Optimized memory consistency implementations
2. Memory simulation features:
• Two-level cache hierarchy
• Multiported and pipelined L1 cache, pipelined L2 cache
• Multiple outstanding cache requests
• Memory interleaving
• Software-controlled non-binding prefetching
3. Multiprocessor system features:
• CC-NUMA shared-memory system with directory-based coherence
• Support for MSI or MESI cache coherence protocols
• Support for sequential consistency, processor consistency, and release consistency
• Wormhole-routed mesh network
RSIM supports most instructions generated by current C compilers for the UltraSPARC-I
or UltraSPARC-II with Solaris 2.5 or 2.6. It does not support 64-bit integer register instructions
and quadruple-precision floating-point instructions from the SPARC V9 architecture [18].
RSIM ran only on Solaris/SPARC and IRIX/MIPS and could be easily ported to other
big-endian architectures. It was later ported to little-endian GNU/Linux running on x86 systems
by Fernandezet al, at the University of Murcia in Spain. Experimental results indicated that
Linux x86 machines were faster than their more expensive and less freely available Solaris/SPARC
counterparts [23]. Though RSIML runs on Linux/IA-32 it is not based on the port above, RSIML
was ported to Linux/IA-32 here based only on RSIM 1.0.
44
5.1.2 RSIML
RSIML is RSIM, extended and modified in Department of Electrical Engineering at Louisiana
State University. Among the other changes made at LSU, the following is fundamental to this
thesis [21, 22].
5.1.2.1 New API for Hardware Prefetching
RSIM only had support for software prefetch. Hardware load prefetching and an API for easily
adding hardware prefetch schemes was included in RSIML. Currently, one of processor’s ports is
used to issue a prefetch request to the memory subsystem and prefetching into L1 and L2 cache is
supported. Also, a limit can be set on the number of outstanding prefetches. Several new prefetch
statistics to be reported for all prefetch schemes (including software prefetch) and callback hooks
for when certain events, like cache miss and arrival of data into the cache, occur were added.
RSIML was further modified to implement the CAPPH exactly as described in 4. Static
analysis of the (benchmark) programs, ultimately meant to be done by a compiler, to create prefetch
directives was done by hand for this thesis.
5.2 Benchmarks
The Olden benchmarks are a suite of pointer and recursion intensive C programs [24]. The follow-
ing benchmarks from the Olden suite have been used to evaluate our method.
Benchmark Description Data Organization
Em3D Electromagnetic wave propagation in a 3D object Single-linked lists
Health Columbian health care simulation Double-linked lists
Mst Minimum spanning tree of a graph Array of single-linked lists
The above benchmarks were particulary chosen because they are pointer-intensive, and
health and mst also have recursive functions which access linked data structures on the way up to
the top node, and are thus well-suited to illustrate and evaluate the performance of CAPPH.
45
5.3 Simulation Parameters
The simulation parameters were chosen to make the system close to modern-day computer sys-
tems. Following is the specification of the simulated system:
• Processor characteristics: Issue width: 2, ROB size: 4; Functional units: integer ALU’s: 4,
floating-point ALU’s: 1, address units: 2, and memory units: 4. CPU to L1 Cache Ports: 4.
• Memory: Store buffer size: 4, load-store queue: 4; Store/Load dependence handling: when
an address is unavailable assume no dependence and re-execute later if necessary. Memory
Interleaving: 16, Bus Width: 16 B; Data cache L1: 32 KiB, line size: 64 B, 4-way set
associative, latency: 1 cycles; L2: 128KiB, line size: b4B, 8-way set associative, latency:
16 cycles. Instruction cache L1: 32 KiB; line size: 256 B, 5-way set associative, latency 10
cycles.
• Prefetch hardware: Prefetch port limit: 1; Max. allowed prefetches in flight: 1000;
• Branch Prediction: Branch history table size:216, history size: 16, Method: YAGS. Jump
target buffer size:216, Global history register-indexed.
Chapter 6
Performance Evaluation
Simulations were performed to determine whether CAPPH can reduce the miss ratio of loads that
access LDS. Three benchmarks em3d, health and mst from the Olden suite were selected. The
prefetch directives were manually written and have been included in the Appendix section.
CAPPH has been compared with the sequential prefetcher in this thesis. But CAPPH and
sequential prefetch are mutually orthogonal and attempt to prefetch different address sequences.
Hence detailed information pertaining to individual load statistics have been collected.
This chapter will begin by discussing results from the individual load statistics and then
move on to presenting data on accuracy, coverage and timeliness.
6.1 Individual Load Statistics
6.1.1 Conventional System
The following tables show load instructions from the benchmarks em3d, health and mst with their
average access times and number L1 cache misses. Numbers used in the PF scheme column
represent no prefetch (0), sequential prefetch (1) and CAPPH (2).
Table 6.1: em3d Troublesome Load Statistics On a Conventional System
0x11038 0x11030
PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss
0 75.99 229061 122.28 363906
1 88.94 232392 146.21 370059
2 88.89 229534 145.06 366186
46
47
Table 6.2: health Troublesome Load Statistics On a Conventional System
0x10eb8 0x10a70 0x10ee0
PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss Acc. Time L1 Miss
0 115.05 174445 114.67 17376 60.94 33432
1 57.62 174188 48.77 17418 17.61 34446
2 113.31 174352 114.64 17376 58.64 34436
Table 6.3: mst Troublesome Load Statistics On a Conventional System
0x1098c 0x1096c 0x109ac 0x10930
PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss Acc. Time L1 Miss Acc. Time L1 Miss
0 181.83 236329 176.46 130366 12.27 8183 118.09 124909
1 178.7 236238 172.15 130195 11.47 8144 157.52 128155
2 181.83 236329 176.46 130366 12.27 8183 118.09 124909
Tables 6.1, 6.2 and 6.3 show the results of some most troublesome loads that executed on a
conventional system, for a system using no prefetch, for sequential prefetch and for CAPPH. The
system specifications are, L1 Size: 128KiB, L2 Size: 128, line size: 64B, and memory latency:
100.
Table 6.1 shows loads from em3d. Notice that the number of L1 misses reduced for both
0x11038 and 0x11030 in CAPPH compared to sequential prefetch.
Table 6.2 shows results of loads from the health benchmark. The number of cache misses is
almost same for systems with no prefetch, sequential prefetch and CAPPH. However, the average
access time with sequential prefetch is much lower than it is with no-prefetch and CAPPH. It means
that although the loads miss the cache same number of times, a greater fraction of memory latency
was hidden by sequential prefetch. This happens as health exhibits good amounts of sequential
access in memory reference patterns. CAPPH does not perform as well because recursive functions
are executed during most of the runtime and each instance of the recursive function has four calls
to itself. The need to synchronize the program with prefetch mechanism, does not allow prefetch
48
to go far ahead of the program while in sequential prefetch, access to a cache line triggers prefetch
for the line adjacent to it.
Performance of CAPPH was better than sequential prefetch for some loads in em3d even
though it has recursive functions because these functions do not have as many calls to themselves
and do not dominate execution time as it is with functions in health.
Table 6.3 shows statistics for loads from mst. Notice that the loads have almost the same
access times and L1 misses with and without prefetch. This is because the L1 hit ratio for mst is
almost 1 and so the benefit from prefetching is not much.
6.1.2 Effect Of Varying Cache Size
Changing the cache size can change the fraction of LDS items that need prefetching. To gauge the
effect of that change on performance the L1 cache size was varied.
Table 6.4 shows that CAPPH performed better than sequential in avoiding cache misses in




























































































































































































































































































































Notice in table 6.5 that CAPPH did not perform as well in health for different cache sizes.
There is almost no benefit of CAPPH in health as it has to was for some instructions to commit
before it can proceed with prefetching.
Table 6.5: health Troublesome Load Statistics For Varying L1 Cache Size
0x10eb8 0x10a70 0x10ee0
L1 Size PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss Acc. Time L1 Miss
1 0 115.05 174445 114.67 17376 60.94 33432
1 1 57.62 174188 48.77 17418 17.61 34446
1 2 113.31 174352 114.64 17376 58.64 34436
4 0 114.42 167272 114.28 16959 58.46 27307
4 1 57.23 162552 48.48 16737 13.79 24217
4 2 112.86 167088 114.26 16959 56.16 27721
8 0 113.83 160348 – – – –
8 1 57.18 150068 – – – –
8 2 112.3 160778 113.77 16425 54.61 24090
16 0 113.21 153259 – – – –
16 1 57.53 131474 – – – –
16 2 111.7 153470 113.81 16460 53.13 20332
32 0 112.73 148185 – – – –
32 1 57.51 109696 48.88 11525 10.17 9386
32 2 111.11 147327 113.31 15783 52.17 17619
64 0 112.49 145594 – – – –
64 1 56.91 89268 50.26 9433 9.25 5317
64 2 110.71 143669 112.43 14943 51.26 15950
128 0 111.08 142141 – – – –
128 1 55.34 71907 48.7 7162 9.44 3197
128 2 109.31 139789 – – – –
Table 6.6 shows that mst exhibited similar results in access times and number of L1 and L2

































































































































































































































































































































































































































































































































































































































6.1.3 Effect Of Varying Latency
The timeliness of prefetches can be affected by a change is memory latency. The effect may be
more pronounced in CAPPH because it waits for the prefetch-requested data to arrive into the L1
cache before issuing the next directive and so there is always at most one in-flight prefetch. Data
pertaining to varying memory latency is in tables 6.7, 6.8 and 6.9.
Notice that access times of loads increased with latency. Also notice that increase in access
time was higher in CAPPH than in sequential and no-prefetch indicating that it is more sensitive
to increase in memory latency as was expected.
In table 6.7 showing statistics of loads from em3d, it can be seen that although the aver-
age access time of CAPPH is larger than that of no-prefetch, the number of misses in L1 cache
reduced. Notice that the access time for 0x11038 with memory latency 150 is approximately same
for CAPPH and sequential prefetch and the load misses more times with sequential prefetch. This
can be caused due to higher likelihood of cache pollution in sequential prefetch than CAPPH.
Table 6.7: em3d Troublesome Load Statistics For Varying Memory Latency
0x11038 0x11030
Mem. Latency PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss
50 0 – – 56.84 363868
50 1 – – 65.86 370047
50 2 – – – –
75 0 – – 72.49 363868
75 1 51.37 232388 84.71 370046
75 2 – – 65.98 291671
100 0 54.87 229063 88.5 363910
100 1 63.33 232398 104.17 370056
100 2 65.91 235335 106.95 376256
150 0 75.99 229061 122.28 363906
150 1 88.94 232392 146.21 370059
150 2 88.89 229534 145.06 366186
200 0 97 229068 154.03 363888
200 1 113.62 232392 184.92 370001
200 2 108.77 227781 174.78 362932
53
Table 6.8 has data similar to the previous results of the health benchmark. Health performs
better with sequential prefetch. The miss rates of loads are higher in CAPPH than in sequential
prefetch.
Table 6.8: health Troublesome Load Statistics For Varying Memory Latency
0x10be8 0x10a70 0x10ee0
Mem. Latency PF Scheme Acc. Time L1 Miss Acc. Time L1 Miss Acc. Time L1 Miss
50 0 70.66 148185 – – – –
50 1 37.19 109736 32.43 11533 7.7 9357
50 2 69.57 147028 70.79 15781 32.79 17483
75 0 91.69 148185 – – – –
75 1 47.2 109720 40.62 11529 8.92 9371
75 2 90.35 147457 92.05 15770 42.5 17719
100 0 112.73 148185 – – – –
100 1 57.51 109696 48.88 11525 10.17 9386
100 2 111.11 147327 113.31 15783 52.17 17619
150 0 154.84 148184 – – – –
150 1 79.48 109669 65.46 11521 12.97 9482
150 2 152.52 147177 155.16 15222 71.36 17504
200 0 197.02 148184 – – – –
200 1 102.52 109614 82.15 11519 16.02 9587
200 2 194.18 147353 197.64 15222 90.53 17577


















































































































































































































































































































































































































































































































































Figure 6.1: Variation of Accuracy with Memory Latency
6.2 Accuracy
Figure 6.1 shows that CAPPH out-performs sequential prefetch more often than the other way
around. The graph shows that the accuracy does not change with latency for health and mst but
decreases a lot for em3d. This is because fewer prefetches are issued by CAPPH for in health
and mst than in em3d as prefetches are issued mostly for cached data as mst by itself has a high
L1 hit rate and health issues demand fetches before CAPPH can prefetch as CAPPH spends most
time waiting for a particular instruction to commit. Prefetches are issued for em3d, but CAPPH is
sensitive to memory latency and its accuracy decreases as latency increases
The variation in accuracy with increase in cache size remains almost constant. As the cache
size increases, the difference between the accuracies of CAPPH and sequential prefetch increases
because sequential prefetch being a purely hardware scheme issues prefetches for successive cache
lines. CAPPH instead currently may issue prefetches for some data within the same line before
moving on to the next line. So CAPPH takes longer to move through the cached lines than sequen-
tial prefetch before issuing the next prefetch. This delay increases with increase in the cache size
as more data is stored in a larger cache.
56
Figure 6.2: Variation of Accuracy with L1 Cache Size
Figure 6.3: Variation of Accuracy with Cache Line Size
57
Figure 6.4: Variation of Coverage with Memory Latency
The same effect as in figure 6.2 is observed in 6.3. Observe that the accuracy falls more
steeply in health than in em3d as the locality exhibited by health is higher. On the other hand
accuracy remains high and constant for mst because prefetches for mst mostly attempt to prefetch
cached data. So as CAPPH does not have to wait for long for a memory fetch to finish before
issuing another, it achieves high accuracy. Although this does not provide performance benefit to
mst, it gives us an insight that more can be achieved with better synchronization.
6.3 Coverage
Graph 6.4 shows that the coverage decreases with increase in memory latency and that coverage is
almost negligible for health and mst; both for the same reasons already mentioned.
Figure 6.5 shows the variation in coverage with latency and is similar to figure 6.4.
6.4 Timeliness
The percentage of timely prefetches issued has been computed, with and without including late
prefetches. The percentage of timely prefetches (without including late prefetches) is calculated
58
Figure 6.5: Variation of Coverage with L1 Cache Size
Figure 6.6: Variation of Coverage with Cache Line Size
59
Figure 6.7: Variation of Timeliness with Memory Latency
as the ratio of the number useful prefetches to the number of actual prefetches. Percentage of
timely prefetches including the late prefetches is calculated as the ratio of the sum of useful and
late prefetches to the number of actual prefetches. The number of actual prefetches is computed as
the difference between the total number of prefetches issued and the number of cached prefetches.
The percentage of timely prefetches is higher in CAPPH as it is not as aggressive as se-
quential prefetch in which a lot of unnecessary prefetches are issued.
Figure 6.7 shows that the percentage of timely prefetches decreases as latency increases.
This is because it takes longer to fetch data from memory and the time in advance of the actual
load needing data, at which the prefetch is issued does not increase.
Chapter 7
Conclusions
A new hardware/software prefetch scheme has been described in this thesis. As the required in-
formation is collected by the compiler and used by the hardware during runtime, CAPPH has the
the advantages of both hardware and software prefetch techniques and is capable of eliminating
their drawbacks. It can also prefetches for recursive functions that access LDS that are otherwise
difficult to prefetch with promising results.
CAPPH is similar to thread-based prefetch, but here the compiler writes the instructions
(prefetch directives) to be executed in the thread, and the thread is not executed by the processor
rather by separate prefetch hardware. Although, the prefetch directives are not executed by the
processor, their execution has to synchronize with the program in order to have timely prefetches.
This synchronization is currently done using the program counter of instructions that commit.
Commit stream has been used for synchronization because it is the easiest and some instructions
need to store register values of an instruction when it commits. Therefore, it is necessary for
CAPPH to be on-chip instead of any other layer of the memory system.
The drawbacks of using the commit instruction stream to trigger CAPPH prefetches is that,
firstly the prefetches are not issued as early as possible thereby delaying CAPPH prefetches and
secondly, prefetch can be unnecessarily slowed by a region of code with infrequent commits. For
example, prefetches are not issued when instructions are squashed after a branch misprediction as
none of the instructions commit and the time between the mispredicted branch and the commit of
the first instruction after it is wasted with no prefetches issued.
Another source to synchronize with is the address reference stream. It will allow CAPPH
to prefetch data to a larger distance ahead and provide better results. Such mechanism might store
60
61
the prefetched addresses in a queue and use hints inserted into load instructions to keep track of
which address in the queue is currently being accessed by the program. The distance between the
address being prefetched and the address being accessed by the program can be determined by
the hardware depending on parameters like the cache size. This is feasible because CAPPH can
achieve good accuracy.
Currently, at most one prefetch directive is issued for every instruction that commits and
so the complete potential of the memory system has not been exploited. This is because, the code
does not currently support verification of dependencies between successive directives which would
otherwise allow prefetches to be issued faster thereby improving performance.
Future studies can consider complete-time and perhaps decode-time synchronization.Better
compiler techniques to provide information about data-flow in the program
Bibliography
[1] Anacker W. and Wang C. P., “Performance Evaluation of COmputing Systems with Mem-
ory Hierarchies,”IEEE Transactions on Computers, Vol. 16, No. 6, p. 764-773, Dec. 1967.
[2] Al-Sukhni H., Bratt I. and Connors D. A., “Compiler-Directed Content-Aware Prefetching
for Dynamic Data Structures,”In Proceedings of the 12th International Conference on
Parallel Architectures and Compilation Techniques, p. 91-100, Oct 2003.
[3] Annavaram M., Patel J. M. and Davidson E. S., “Data Prefetching by Dependence Graph
Precomputation,”In Proceedings of the 28th Annual International Symposium on Com-
puter Architecture, p.52-61, Jul 2001.
[4] August D. I., Connors D. A., Mahlke S. A., Sias J. W., Crozier K. M., Cheng B-C., Eaton
P.R., Olaniran Q. B. and Hwu W. W, “Integrated Predicated and Soeculative Execution in
the IMPACT EPIC Architecture,”Proceedings of the 25th annual International Symposium
on Computer Architecture, p. 227-237, Jun 1998.
[5] Chen T-F. and Baer J. L., “An effective on-chip preloading scheme to reduce data access
penalty,”Supercomputing, p. 176-186, 1991. Also TR 91-03-07, Department of Computer
Science and Engineering, University of Washington.
[6] Chen T-F. and Baer J. L., “Reducing Memory Latency via Non-blocking and Prefetching
Caches,” Technical Report, University of Washington, Jun 1992.
[7] Collins J., Sair S., Calder B. and Tullsen D. M., “Pointer Cache Assisted Prefetching,”
Proceedings of the 35th Annual International Symposium on Microarchitecture, p. 62-73,
Nov 2002.
[8] Craig B. Zilles, “Benchmark Health Considered Harmful,”ACM SIGARCH Computer Ar-
chitecture News, Vol. 29, p. 4-5, 2001.
[9] Joseph D. and Grunwald D., “Prefetching Using Markov Predictors,”Proceedings of the
24th annual International Symposium on Computer Architecture, p. 252-263, 1997.
[10] Karlson M., Dahlgren F. and Stenstrom P., “A Prefetching Technique for Irregular Ac-
cesses to Linked Data Structures,”Proceedings of the 6th International Conference on
High Performance Computer Architecture, p. 206-217, 2000.
62
63
[11] D. Kroft, “Lockup-free instruction fetch/prefetch cache organization,”I Proceedings of
the 8th Annual International Symposium on Computer Architecture, p. 81-87, 1981.
[12] Lipasti M. H., Schmidt W. J., Kunkel S. R. and Roediger R. R, “SPAID: Software Prefetch-
ing in Pointer- and Call-Intensive Environments,”Proceedings of the 28th annual Interna-
tional Symposium on Microarchitecture, p. 231-236, Nov 1995.
[13] Luk C-K. and Mowry T. C., “Compiler-Based Prefetching for Recursive Data Structures,”
Proceedings of the 7th International Conference on Architectural Support for Program-
ming Languages and Operating Systems, p. 222-233. 1996.
[14] Porterfield A. K., “Software Methods for Improvement of Cache Performance on Super-
computer Applications,”, Ph.D Thesis, Rice University, May 1989.
[15] Smith A. J., “Sequential Program Prefetching in Memory Hierarchies,”IEEE Computer,
Vol 11, No. 12, p. 7-21, Dec 1978.
[16] Smith A. J., “Cache Memories,”Computing Surveys, Vol 14, No. 3, p. 473-530, Sep 1982.
[17] VanderWiel S. P. and Lilja D. J., “Data Prefetch Mechanisms,”ACM Computing Surveys,
Jun 2000.
[18] Vijay S. P., Parthasarathy R. and Sarita V. A. “RSIM: An Execution-Driven Simulator for
ILP-Based Shared-Memory Multiprocessors and Uniprocessors,”P oceedings of the Third
Workshop on Computer Architecture Education, Feb 1997
[19] Zilles C. and Sohi G., “Execution-based Prediction Using Speculative Slices,”Proceedings
of the 28th Annual International Symposium on Computer Architecture, p. 2-13, Jul 2001.
[20] http://www.ece.lsu.edu/ee4720/
[21] http://www.ece.lsu.edu/bugzilla/showbug.cgi?id=52
Date visited: Oct 19, 2005.
[22] http://www.ece.lsu.edu/bugzilla/showbug.cgi?id=134
Date visited: Oct 19, 2005.
[23] http://skywalker.dif.um.es/ rfernandez/papers/porting-rsim.ps
Date visited: Oct 19, 2005.
[24] http://www-ali.cs.umass.edu/DaCapo/benchmarks.html
Date visited: Nov 4, 2005.
Appendix: Input Data for CAPPH
Table 7.1: Prefetch Directives for mst.
Function Index PC Action Offset1 Offset2 Offset3 Offset4
0.main 0 10078 d 9 1 0 0
1 1007c r 32 0 358706 0
2 10080 a 5 -1 0 0
call dealwithargs 3 68568 f 1 0 0 0
4 10be0 c 8 0 2 2
store size 5 68580 s 2 1 0 0
6 10be4 c 5 2 5 1
7 10bf4 c 35 1 2 0
8 10bf8 c 31 2 2 -1
call ComputeMst 9 10bfc f 7 0 0 0
10 10c18 r 33 3 0 0
11 0 r 32 33 4 0
12 0 c 4 32 2 0
13 0 m 17 31 1 0
14 0 m 11 35 2 8
15 10ce0 c 35 35 5 2
16 10ce4 c 63 63 2 35
call Do all BlueRule 17 10ce8 f 6 0 0 0
18 10cf0 o 33 16 35 18
19 10cf4 c 63 1 5 2
20 10cf8 f 6 0 0 0
64
65
Function Index PC Action Offset1 Offset2 Offset3 Offset4
21 10d08 o 63 30 0 0
22 10d0c r 62 63 -40 0
23 10d10 r 62 63 -48 0
24 10d30 a 4 4 0 0
25 0 m 2 33 5 4
26 0 r 4 4 4 0
27 0 f 5 0 0 0
28 10cb4 c 31 31 2 -1
29 10cb8 m -15 31 5 0
30 0 e 2 0 0 0
1. dealwithargs 31 107e8 o 1 1 -1 -1
32 107ec s 9 4 0 0
33 107f0 0 0 0 0 0
34 10814 o 1 8 -1 -1
35 10818 s 9 4 0 0
36 1081c e 1 0 0 0
2. HashLookup 37 0 r 45 7 0 0
38 0 d 43 0 4 0
39 0 c 44 42 4 3
40 0 c 44 44 12 45
41 0 r 45 43 0 0
42 0 c 44 44 3 2
43 0 i 46 45 44 0
44 0 m 5 46 1 0
45 0 r 44 46 0 0
46 0 m 3 44 1 42
47 0 r 46 46 8 0
48 0 m -3 46 5 0
66
Function Index PC Action Offset1 Offset2 Offset3 Offset4
49 109d8 e 1 0 0 0
3. HashInsert 50 0 r 45 7 0 0
51 0 d 43 0 4 0
52 0 c 44 42 4 3
53 0 c 44 44 12 45
54 0 r 45 43 0 0
55 0 c 44 44 3 2
56 0 i 46 45 44 0
57 0 m 5 46 1 0
58 0 r 44 46 0 0
59 0 m 3 44 1 42
60 0 r 46 46 8 0
61 0 m -3 46 5 0
62 10b2c e 1 0 0 0
4. HashDelete 63 0 r 45 7 0 0
64 0 d 43 0 4 0
65 0 c 44 42 4 3
66 0 c 44 44 12 45
67 0 r 45 43 0 0
68 0 c 44 44 3 2
69 0 i 46 45 44 0
70 0 m 5 46 1 0
71 0 r 44 46 0 0
72 0 m 3 44 1 42
73 0 r 46 46 8 0
74 0 m -3 46 5 0
75 10bc0 e 1 0 0 0
5. BlueRule 76 0 m 16 4 1 0
67
Function Index PC Action Offset1 Offset2 Offset3 Offset4
77 0 r 42 4 0 0
78 0 r 43 4 8 0
79 0 c 42 33 2 0
80 0 f 2 0 43 42
81 0 r 41 4 4 0
82 0 r 61 4 8 0
83 0 m 9 41 1 0
84 0 m 3 41 5 33
85 0 r 41 41 4 0
86 0 m -2 41 5 0
87 0 r 43 41 8 0
88 0 f 2 0 0 0
89 0 r 41 41 4 0
90 0 m -7 41 5 0
91 0 0 0 0 0 0
92 69236 e 1 0 0 0
6.Do all BlueReturn 93 0 m 9 35 2 8
94 10ee0 c 35 35 5 2
95 10ee4 c 63 63 2 35
96 10ee8 f 6 0 0 0
97 10ef0 o 35 16 63 10
98 0 f 6 0 0 0
99 10f0c z 3 10f10 0 0
100 0 o 62 30 0 0
101 0 r 62 62 -28 0
102 0 m 2 33 5 4
103 0 r 4 4 4 0
104 0 f 5 0 0 0
68
Function Index PC Action Offset1 Offset2 Offset3 Offset4
105 0 0 0 0 0 0
106 10f44 e 1 0 0 0
7. MakeGraph 107 10f5c o 3 8 0 0
108 10fe4 c 48 1 2 -1
109 10fe8 s 7 4 1 0
110 10ff8 s 6 1 1 0
111 11018 m 5 48 2 0
112 0 c 59 48 6 5
113 0 f 8 0 0 0
114 0 c 48 48 2 -1
115 0 m -3 48 4 0
116 1108c e 1 0 0 0
8. AddEdges 117 0 r 52 3 0 0
118 110c4 d 7 0 0 0
119 110c8 c 56 0 2 0
120 110cc c 53 48 3 2
121 110d4 s 58 1 0 0
122 0 i 55 3 53 0
123 0 m 13 55 1 0
124 0 i 57 58 0 0
125 0 c 51 0 2 0
126 0 m 6 51 1 59
127 0 r 43 55 8 0
128 0 c 52 51 12 5
129 0 c 52 52 3 2
130 0 c 42 57 7 52
131 0 f 3 0 0 0
132 0 c 51 51 2 1
69
Function Index PC Action Offset1 Offset2 Offset3 Offset4
133 70160 m -6 1 11 0
134 0 r 55 55 4 0
135 0 c 59 59 2 1
136 0 m -9 55 5 0
137 0 0 0 0 0 0
138 70184 e 2 0 0 0
70
Table 7.2: Prefetch Directives for em3d.
Function Index PC Action Offset1 Offset2 Offset3 Offset4
0. main 0 10078 d 9 1 0 0
1 1007C r 32 0 361675 0
2 10080 a 5 -1 0 0
call dealwithargs 3 1099C f 1 0 0 0
call initialize graph 4 10A78 f 2 0 0 0
5 0 m 9 13 1 0
6 0 c 20 0 2 0
7 0 c 19 20 3 2
8 0 i 32 9 19 0
9 0 t 32 4 0 20
10 0 i 34 10 19 4
11 0 t 34 4 0 20
12 0 c 20 20 2 1
13 0 m -6 20 2 1
14 0 i 36 9 0 0
call computenode 15 10B28 f 3 0 0 0
16 0 i 36 10 0 0
call computenode 17 10B30 f 3 0 0 0
18 10AA8 e 1 0 0 0
1. dealwithargs 19 10874 o 1 8 0 0
20 1087C s 14 4 0 0
21 107E4 s 1 1 0 0
22 107EC s 14 4 0 0
23 10860 o 2 8 0 0
24 10868 s 15 4 0 0
25 107F8 s 2 1 0 0
71
Function Index PC Action Offset1 Offset2 Offset3 Offset4
26 10800 s 15 4 0 0
27 1084C o 3 8 0 0
28 10854 s 16 4 0 0
29 1080C s 3 1 0 0
30 10814 s 16 4 0 0
31 10838 o 4 8 0 0
32 10840 s 17 4 0 0
33 10824 s 4 1 0 0
34 1082C s 17 4 0 0
35 10880 e 0 0 0 0
2. initializegraph 36 110D4 s 5 1 0 0
37 110D8 c 6 5 2 0
38 110DC c 7 5 2 4
39 110E4 s 8 1 0 0
40 110E8 c 9 8 2 0
41 110EC c 10 8 2 4
42 110F0 s 16 1 0 0
43 110F4 c 40 16 3 2
call maketables 44 110F8 f 4 0 0 0
call makeall neighbors 45 110FC f 5 0 0 0
call updateall from-coeffs 46 11114 f 6 0 0 0
call fill all from fields 47 1112C f 7 0 0 0
call localize 48 11154 f 8 0 0 0
49 0 a 5 -1 0 0
50 1117C s 17 1 0 0
51 0 c 32 17 3 2
52 0 c 33 1 3 2
53 0 c 19 0 0 0
72
Function Index PC Action Offset1 Offset2 Offset3 Offset4
54 0 c 20 0 2 0
55 0 i 36 6 20 0
56 0 i 37 9 19 0
57 0 i 35 7 20 0
58 0 i 37 10 19 4
59 0 i 36 6 20 0
60 0 i 36 7 20 4
61 0 i 36 6 20 0
62 0 i 35 7 20 4
63 0 m -4 20 2 32
64 0 m -10 19 2 33
65 11234 e 1 0 0 0
3. computenodes 66 0 t 36 4 0 -1
67 0 0 3 0 0 0
68 10988 e 0 0 0 0
4. maketables 69 113F4 s 12 1 0 0
70 11450 s 11 1 0 0
71 11530 e 0 0 0 0
5. makeall neighbors 72 0 c 41 12 7 40
73 0 d 16 0 0 0
74 0 d 15 0 0 0
75 0 c 42 6 2 0
call makeneighbors 76 11568 f 9 0 0 0
77 0 c 41 11 7 40
78 0 d 16 0 0 0
79 0sheela/write-up d 15 0 0 0
80 115AC c 42 7 2 0
81 115B0 a 5 1 0 0
73
Function Index PC Action Offset1 Offset2 Offset3 Offset4
call makeneighbors 82 0 f 9 0 0 0
83 0 a 5 -1 0 0
84 0 e 1 0 0 0
6. updateall from coeffs 85 0 i 44 7 40 0
86 0 r 45 44 0 0
87 0 d 45 0 0 0
88 11610 r 45 45 4 0
89 0 d 45 0 0 0
90 0 m -2 45 5 0
91 0 i 44 6 40 0
92 0 r 45 44 0 0
93 0 d 45 0 0 0
94 1166C r 45 45 4 0
95 0 d 45 0 0 0
96 0 m -2 45 5 0
97 0 e 100 0 0 0
7. fill all from fields 98 0 c 56 3 3 2
99 0 i 46 7 40 0
100 0 i 47 46 0 0
call fill from fields 101 0 f 10 0 0 0
102 0 i 46 6 40 0
103 0 i 47 46 0 0
104 0 a 5 1 0 0
call fill from fields 105 0 f 10 0 0 0
106 0 a 5 -1 0 0
107 0 e 1 0 0 0
8. localize 108 0 a 5 -1 0 0
109 0 i 54 7 40 0
74
Function Index PC Action Offset1 Offset2 Offset3 Offset4
110 0 i 55 54 0 0
111 0 t 55 4 -1 -1
112 11744 i 54 6 40 0
113 0 i 55 54 0 0
114 0 t 55 4 -1 -1
115 11768 e 1 0 0 0
9. makeneighbors 116 10EAC o 34 8 0 0
117 0 c 20 0 2 0
118 0 r 32 0 259344 0
119 10ED0 s 32 1 0 0
120 0 c 32 32 3 2
121 0 r 33 5 0 0
122 0 i 32 33 32 0
123 0 c 21 0 2 0
124 0 c 22 0 2 0
125 0 i 35 34 22 4
126 0 c 21 21 2 1
127 0 m -2 21 2 20
128 0 c 20 20 2 1
129 0 m -11 20 2 3
130 0 r 41 41 4 0
131 0 d 41 0 0 0
132 0 m -16 41 5 0
133 10FE0 e 2 0 0 0
10. fill from fields 134 0 d 47 0 0 0
135 0 c 21 0 2 0
136 0 c 20 0 2 0
137 0 r 48 47 8 0
75
Function Index PC Action Offset1 Offset2 Offset3 Offset4
138 0 i 55 48 20 4
139 0 c 21 21 2 1
140 0 r 49 47 0 0
141 0 r 50 55 20 0
142 0 c 50 50 2 1
143 0 r 51 55 12 0
144 0 r 52 55 20 0
145 0 c 60 50 3 2
146 0 i 53 51 60 0
147 0 r 54 55 16 0
148 0 c 60 50 3 3
149 0 i 54 54 60 0
150 1105C m -13 21 2 3
151 0 r 47 47 4 0
152 0 m -18 47 5 0
153 0 0 0 0 0 0
154 110BC e 1 0 0 0
76
Table 7.3: Prefetch Directives for health.
Function Index PC Action Offset1 Offset2 Offset3 Offset4
0. main 0 10078 d 9 1 0 0
1 1007C r 32 0 378110 0
2 10080 a 5 -1 0 0
call dealwithargs 3 10C44 f 1 0 0 0
call alloc tree 4 10C60 f 2 0 0 0
5 10CA4 c 33 4 2 0
call sim(top) 6 10CD0 f 3 0 0 0
7 10CFC c 33 4 2 0
results=getresults(top) 8 10D00 f 4 0 0 0
9 10D10 e 0 0 0 0
1. dealwithargs 10 107F4 s 1 2 0 0
11 10800 s 2 2 0 0
12 10818 s 3 2 0 0
13 0 e 0 0 0 0
2. alloc tree 14 10854 a 3 0 0 0
15 10860 s 4 1 1 0
call alloc tree 16 108B8 f 2 0 0 0
call alloc tree 17 108EC f 2 0 0 0
18 108F0 z 4 67816 0 0
19 0 o 32 24 0 0
20 0 d 32 0 0 0
21 0 a 2 -3 0 0
22 1098C e 6 0 0 0
3. sim 23 10DC4 m 53 1 11 0
24 10DDC f 3 0 0 0
25 10DF4 f 3 0 0 0
77
Function Index PC Action Offset1 Offset2 Offset3 Offset4
26 10DFC o 21 30 33 24
27 10E00 d 33 0 0 0
28 10E04 c 34 33 2 32
29 10E08 c 19 0 2 -32
30 10E0C c 35 0 2 -20
31 0 i 36 21 35 0
32 0 m 16 36 1 0
33 0 r 36 36 0 0
34 0 m 14 36 1 0
35 0 r 63 36 4 0
36 0 r 38 34 4 0
37 0 d 63 0 0 0
38 0 m 3 38 3 0
39 0 c 62 34 2 24
40 0 a 4 2 0 0
41 0 c 62 34 2 12
call addList 42 0 f 7 0 0 0
43 0 d 63 0 0 0
44 0 r 36 36 0 0
45 0 r 62 21 35 0
call removeList 46 0 f 8 0 0 0
47 0 m -12 36 5 0
48 0 c 35 35 2 -4
49 0 m -18 35 4 19
50 0 r 38 33 2 68
51 0 m 11 38 1 0
52 0 r 63 38 4 0
53 0 r 39 63 8 0
78
Function Index PC Action Offset1 Offset2 Offset3 Offset4
54 0 m 6 39 5 0
55 0 r 38 34 4 0
56 0 c 62 34 2 36
call removeList 57 0 f 8 0 0 0
58 0 c 62 33 2 20
call addList 59 0 f 7 0 0 0
60 0 r 38 38 0 0
61 0 m -9 38 5 0
62 0 r 38 33 2 56
call checkpatientsassess 63 0 f 5 0 0 0
64 0 r 38 33 2 44
call checkpatientswaiting 65 0 f 6 0 0 0
66 10F34 m 10 0 11 0
67 10FB0 s 63 4 0 0
68 0 r 38 34 4 0
69 0 d 63 0 0 0
70 0 m 3 38 3 0
71 0 c 62 34 2 24
72 0 a 4 2 0 0
73 0 c 62 34 2 12
74 0 f 7 0 0 0
75 0 d 63 0 0 0
76 11000 e 1 0 0 0
4. getresults 77 109B8 0 1 1 0 0
78 109CC f 4 0 0 0
79 109F4 f 4 0 0 0
80 0 o 21 30 33 24
81 0 c 22 0 2 3
79
Function Index PC Action Offset1 Offset2 Offset3 Offset4
82 0 r 18 21 -70 0
83 0 i 18 21 -100 -12
84 0 c 22 22 2 -1
85 0 m -3 22 2 0
86 0 r 33 33 20 0
87 10A50 0 1 1 0 0
88 0 r 39 33 4 0
89 0 r 40 39 4 0
90 0 r 40 39 0 0
91 0 r 33 33 0 0
92 0 a 4 -4 0 0
93 10A90 0 1 0 0 0
94 0 e 1 0 0 0
5. checkpatientsassess 95 0 m 11 38 1 0
96 0 r 42 38 4 0
97 0 c 62 33 2 56
call removeList 98 0 f 8 0 0 0
99 10B58 c 62 33 2 68
call addList 100 0 f 7 0 0 0
101 0 a 4 3 0 0
102 10B9C c 62 33 2 80
call addList 103 0 f 7 0 0 0
104 0 r 38 38 0 0
105 0 m -9 38 1 0
106 0 0 0 0 0 0
107 10BB4 e 1 0 0 0
6. checkpatientswaiting 108 0 m 10 38 1 0
109 0 r 56 33 36 0
80
Function Index PC Action Offset1 Offset2 Offset3 Offset4
110 0 r 63 38 4 0
111 0 m 5 0 3 56
112 0 c 62 33 2 44
call removeList 113 0 f 8 0 0 0
114 0 c 62 33 2 56
call addList 115 0 f 7 0 0 0
116 0 r 38 38 0 0
117 0 m -7 38 1 0
118 0 0 0 0 0 0
119 10C30 e 1 0 0 0
7. addList 120 0 t 62 0 -1 -1
121 0 0 3 0 0 0
122 11164 a 6 0 0 0
123 1116C e 2 0 0 0
8. removeList 124 0 t 62 0 4 -1
125 0 0 3 0 0 0
126 111C8 e 2 0 0 0
Vita
Sheela Doshi was born in the Visakhapatnam, India. She completed her schooling in Timpany
School, Visakhapatnam.
She did her bachelor’s at Maharaja Vijayaram Gajapathi Raj College of engineering, Viziana-
garam (affiliated to Jawaharlal Nehru Technological University), majoring in computer science and
engineering. She graduated with distinction in 2002.
She then joined the Department of Computer Science in Louisiana State University, Baton
Rouge, in Fall 2002. Later she shifted to the Department of Electrical and Computer Engineer-
ing, Louisiana State University, Baton Rouge, to pursue her master’s with a major in computer
engineering, in the Fall of 2003. She will be graduating in May 2006 with a master’s degree in
electrical engineering.
81
