



(B.Eng, TONGJI UNIVERSITY SHANGHAI, CHINA)
A THESIS SUBMITTED FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2010
Acknowledgements
First of all, I would like to express my deepest gratitude to my Ph.D advisor, Profes-
sor Tulika Mitra for her constant guidance and encouragement during my five years
of graduate study. Her persistent guidance helps me stay on track of doing research.
Without her help this dissertation would not have been possible.
I am grateful to my dissertation committee members, Professors Wong Weng Fai,
Teo Yong Meng and Sri Parameswaran for their time and thoughtful comments. Thanks
are also due to Professors Abhik Roychoudhury and Samarjit Chakraborty. It is an
honor for me to work with them throughout my graduate study. I have greatly benefitted
from the discussion I have had with them.
I would like to thank the National University of Singapore for funding me with re-
search scholarship and offering me the teaching opportunities to support my last year of
study. My thanks also go to the administrative staffs in School of Computing, National
University of Singapore for their supports during my study.
I would like to thank my friends in NUS for assisting and helping me in my research:
Ju Lei, Ge Zhiguo, Huynh Phung Huynh, Unmesh D. Bordoloi, Joon Edward Sim,
Ankit Goel, Ramkumar Jayaseelan, Vivy Suhendra, Pan Yu, Li Xianfeng, Liu Haibin,
i
ii
Liu Shanshan, Kathy Nguyen Dang, Andrei Hagiescu and David Lo. My graduate life
at NUS would not have been interesting and fun without them.
I woud like to extend heartfelt gratitude to my parents for their never ending love
and faith in me and encouraging me to pursue my dreams. They are a great source of
encouragement during my graduate study especially when I found it difficult to carry
on. Thank you for always being there.
Finally, this dissertation would not have been possible without the support of my
wife Chen Dan. She sacrificed a great deal ever since I started my graduate study, but
she was never one to complain. The hardest part has been the last year, when I was
doing teaching assistantship and she was looking for jobs. In spite of all the difficulties,





List of Publications x
List of Tables xi
List of Figures xii
1 Introduction 1
1.1 Embedded System Design . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Memory Optimization for Embedded System . . . . . . . . . . . . . . 3
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6




2.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Literature Review 14
3.1 Application Specific Memory Optimization . . . . . . . . . . . . . . . 14
3.2 Design Space Exploration of Caches . . . . . . . . . . . . . . . . . . . 15
3.2.1 Trace Driven Simulation . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Analytical Modeling . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Hard Real-time Systems . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 General Embedded Systems . . . . . . . . . . . . . . . . . . . 21
3.4 Code Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Cache Modeling for Timing Analysis . . . . . . . . . . . . . . . . . . 25
4 Cache Modeling via Static Program Analysis 27
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Cache Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Concrete Cache States . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Probabilistic Cache States . . . . . . . . . . . . . . . . . . . . 32
4.4 Static Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Analysis of DAG . . . . . . . . . . . . . . . . . . . . . . . . . 35
v4.4.2 Analysis of Loop . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Special case for Direct Mapped Cache . . . . . . . . . . . . . . 39
4.4.4 Analysis of Whole Program . . . . . . . . . . . . . . . . . . . 41
4.5 Cache Hierarchy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Level-1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6.2 Multi-level Caches . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Design Space Exploration of Caches 57
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 General Binomial Tree (GBT) . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Probabilistic GBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Concatenation of Probabilistic GBTs . . . . . . . . . . . . . . 64
5.3.2 Combining GBTs in a Probabilistic GBT . . . . . . . . . . . . 66
5.3.3 Bounding the size of Probabilistic GBT . . . . . . . . . . . . . 68
5.3.4 Cache Hit Rate of a Memory Block . . . . . . . . . . . . . . . 70
5.4 Static Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Instruction Cache Locking 76
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vi
6.2 Cache Locking Problem . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Cache Locking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1 Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.2 Heuristic Approach . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Procedure Placement 111
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 Procedure Placement Problem . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Intermediate Blocks Profile . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Procedure Placement Algorithm . . . . . . . . . . . . . . . . . . . . . 120
7.5 Neutral Procedure Placement . . . . . . . . . . . . . . . . . . . . . . . 123
7.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.6.1 Layout for a Specific Cache Configuration . . . . . . . . . . . . 129
7.6.2 Neutral Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8 Putting it All Together 141
8.1 Integrated Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 142
9 Conclusion 144
9.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
vii
9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bibliography 146
Abstract
The application specific nature of embedded systems creates the opportunity to design
a customized system-on-chip (SoC) platform for a particular application or an applica-
tion domain. Cache memory subsystem bears significant importance as it bridges the
performance gap between the fast processor and the slow main memory. In particular,
instruction cache, which is employed by most embedded systems, is one of the foremost
power consuming and performance determining microarchitectural features as instruc-
tions are fetched almost every clock cycle. Thus, careful tuning and optimization of
instruction cache memory can lead to significant performance gain and energy saving.
The objective of this thesis is to exploit application characteristics for instruction
cache optimizations. The application characteristics we use include branch probability,
loop bound, temporal reuse profile and intermediate blocks profile. These application
characteristics are identified through profiling and exploited by our subsequent analyti-
cal approach. We consider both hardware and software solutions.
The first part of the thesis focuses on hardware optimization — identifying best
cache configurations to match the specific temporal and spatial localities of a given
application through analytical approach. We first develop a static program analysis to
viii
ix
accurately model the cache behavior of a specific cache configuration. Then, we extend
our analysis by taking the structural relations among the related cache configurations
into account. Our analysis can estimate the cache hit rates for a set of cache configu-
rations with varying number of sets and associativity in one pass as long as the cache
line size remains constant. The input to our analysis is simply the branch probability
and loop bounds, which is significantly more compact compared to the memory address
traces required by trace-driven simulators and other trace based analytical works.
The second part of the thesis focuses on software optimizations. We propose tech-
niques to tailor the program to the underlying instruction cache parameters. First, we
develop a framework to improve the average-case program performance through static
instruction cache locking. We introduce temporal reuse profile to accurately and effi-
ciently model the cost and benefit of locking memory blocks in the cache. We propose
two cache locking algorithms : an optimal algorithm based on branch-and-bound search
and a heuristic approach. Second, we propose an efficient algorithm to place procedures
in memory for a specific cache configuration such that cache conflicts are minimized.
As a result, both performance and energy consumption are improved. Our efficient al-
gorithm is based on intermediate blocks profile that accurately but compactly models
cost-benefit of procedure placement for both direct mapped and set associative caches.
Finally, we propose an integrated instruction cache optimization framework by com-
bining all the techniques together.
List of Publications
• Cache Modeling in Probabilistic Execution Time Analysis. Yun Liang and Tulika Mitra.
45th ACM/IEEE Design Automation Conference (DAC), June 2008.
• Cache-aware Optimization of BAN Applications. Yun Liang, Lei Ju, Samarjit Chakraborty,
Tulika Mitra, Abhik Roychoudhury. ACM International Conference on Hardware/Software
Codesign and System Synthesis (CODES + ISSS), October 2008
• Static Analysis for Fast and Accurate Design Space Exploration of Caches . Yun Liang,
Tulika Mitra. ACM International Conference on Hardware/Software Codesign and Sys-
tem Synthesis (CODES + ISSS), October 2008
• Instruction Cache Locking using Temporal Reuse Profile. Yun Liang and Tulika Mitra.
47th ACM/IEEE Design Automation Conference (DAC), June 2010.
• Instruction Cache Exploration and Optimization for Embedded Systems. Yun Liang. 13th
Annual ACM SIGDA Ph.D. Forum at Design Automation Conference (DAC), June 2010.
• Improved Procedure Placement for Set Associative Caches. Yun Liang and Tulika Mi-
tra. International Conference on Compilers, Architecture, and Synthesis for Embedded
Systems (CASES), October 2010.
x
List of Tables
4.1 Benchmarks characteristics and runtime comparison of Dinero and our analysis. 47
5.1 Runtime comparison of Cheetah simulator and our analysis. Simulation time
is shown in Column Cheetah. Ratio is defined as CheetahSinglePassAnalysis . . . . . . 74
6.1 Characteristics of benchmarks. . . . . . . . . . . . . . . . . . . . . . . . 95
7.1 Characteristics of benchmarks. . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 Cache misses of different code layouts running on different cache configurations.138
xi
List of Figures
2.1 Cache architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Annotated control flow graph. Each basic block is annotated with its execu-
tion count. Each edge is associated with its execution count and frequency
(probability). For example, the execution count of basic block B2 is 40 and
the execution count of edge B2 → B4 is 40 too. The edge (B2 → B4)
probability is 0.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Control flow graph consists of two paths with equal probability (0.5). The
illustration is for a fully-associative cache with 4 blocks starting with empty
cache state. m0–m4 are the memory blocks. Two probabilistic cache states
before B4 are shown. The probabilistic cache states merging and update oper-
ation are shown for B4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Analysis of whole program. . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Top-down cache hierarchy analysis . . . . . . . . . . . . . . . . . . . . . 45
4.5 The estimation vs simulation of cache hit rate across 20 configurations. . . . . 49
4.6 Cache set convergence for different values of associativity. . . . . . . . . . . 50
xii
LIST OF FIGURES xiii
4.7 The estimation vs simulation of cache hit rate across 20 configurations. Esti-
mation is based on the profiles of an input different from simulation input. . . 51
4.8 Performance-energy design space and pareto-optimal points for both simula-
tion and estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Cache content and construction of generalized binomial forest. Memory blocks
are represented by tags and set number, for example, for memory block 11(00),
00 denotes the set and 11 is the tag. . . . . . . . . . . . . . . . . . . . . . 60
5.2 Mapping from GBT to array. The nodes in GBT are annotated with their ranks. 62
5.3 Concatenation for GBTs where M = 1 and N = 2. . . . . . . . . . . . . . 66
5.4 Probabilistic GBT combination and concatenation. . . . . . . . . . . . . . . 67
5.5 Pruning in probabilistic GBT. . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Estimation vs simulation across 20 configurations. . . . . . . . . . . . . . . 72
5.7 Estimation vs simulation across 20 configurations. Estimation is based on the
profiles of an input different from simulation input. . . . . . . . . . . . . . . 73
6.1 Temporal reuse profiles from a sequence of memory access for a 2-way set
associative cache. Memory blocks m0,m1 and m2 are mapped to the same
set. Cache hits and misses are highlighted. . . . . . . . . . . . . . . . . . . 83
6.2 TRP size across different cache configurations. . . . . . . . . . . . . . . . . 97
6.3 Miss rate improvement (percentage) over cache without locking for various
cache configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
LIST OF FIGURES xiv
6.4 Execution time improvement (percentage) over cache without locking for var-
ious cache configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Energy consumption improvement (percentage) over cache without locking
for various cache configurations. . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Cache miss rate improvement comparison of heuristic and optimal algorithm
for 2-way set associative cache. . . . . . . . . . . . . . . . . . . . . . . . 103
6.7 Average cache miss rate improvement comparison. . . . . . . . . . . . . . . 104
6.8 Miss rate improvement (percentage) over cache without locking for various
cache configurations. for FIFO replacement policy. . . . . . . . . . . . . . 108
6.9 Procedure placement (TPCM) vs Cache locking. Cache size is 8K. . . . . . . 109
7.1 Memory address mapping. The address is byte address and line size is as-
sumed to be 2 bytes (last bit). . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Procedure block trace and intermediate blocks profile. Block (line) size is
assumed to be 1 byte. The number of cache sets is assumed to be 2. . . . . . . 118
7.3 CJpeg address trace vs IBP for various inputs with different sizes. . . . . . . 129
7.4 Cache miss rate improvement and code size expansion compared to original
code layout for 4K direct mapped cache. . . . . . . . . . . . . . . . . . . . 130
7.5 Cache miss rate improvement and code size expansion compared to original
code layout for 8K direct mapped cache. . . . . . . . . . . . . . . . . . . . 131
7.6 Cache miss rate improvement compared to original code layout for set asso-
ciative cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.7 Execution time improvement compared to original code layout. . . . . . . . . 134
LIST OF FIGURES xv
7.8 Energy reduction compared to original code layout. . . . . . . . . . . . . . 136
7.9 Cache miss rate improvement of IBP over original code layout for FIFO re-
placement policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10 Average cache miss rate improvement comparison. . . . . . . . . . . . . . . 140
8.1 Integrated instruction cache optimization flow. . . . . . . . . . . . . . . . . 142
8.2 Cache miss rate improvement of integrated instruction cache optimizations.
Baseline cache configuration is a direct mapped cache. Step 1: Design Space
Exploration (DSE); Step 2: Procedure Placement (Layout); Step 3: Instruction
Cache Locking (Locking). . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Chapter 1
Introduction
1.1 Embedded System Design
Embedded systems are application-specific systems that execute one or a few dedi-
cated applications, e.g., multimedia, sensor networks, automotive, and others. Hence,
the particular application running on the embedded processors is known a priori. The
application-specific nature of embedded systems opens up the opportunities for the em-
bedded system designers to perform architecture customizations and software optimiza-
tions to suit the needs of the given applications. Such optimization opportunities are not
possible for general purpose computing systems. General purpose computing systems
are designed for good average performance over a set of typical programs that cover
a wide range of applications with various behaviors. So the actual workload to the
systems is unknown. However, embedded systems implement one or a set of fixed ap-
plications. Their application characteristics can be used in embedded system design.
1
CHAPTER 1. INTRODUCTION 2
This leads to various novel optimization opportunities involving both architecture and
compilation perspectives, such as application specific instruction set design, application
specific memory architecture and architecture aware compilation flow.
Another characteristic of embedded system design is the great variety of design con-
straints to meet. Design constraints include real-time performance (e.g., both average
and worst case), hardware area, code size, etc. More importantly, embedded systems
are widely used in low power or battery operated devices such as cellular phones. As a
result, energy consumption is one indispensable design constraint.
Using application characteristics, both architecture and software optimizations aim
to optimize the system to meet various design constraints. The customization opportu-
nities of application-specific embedded systems arise from the flexibility of the underly-
ing architecture itself. Modern embedded systems feature parameterizable architectural
features, e.g., functional units and cache. Thus, from a hardware perspective, various
architecture parameters can be tuned or customized. Hence, one challenging task of
embedded system design is to select the best parameters for the application from the
vast number of system parameters. Therefore, the embedded system designers need
fast design space exploration tools with accurate system analysis capabilities to explore
various design alternatives that meet the expected goals. Customized processors, in
turn, need sophisticated compiler technology to generate efficient code suitable for the
underlying architecture parameters. From a software perspective, compiler can tailor
the program to the specific architecture.
CHAPTER 1. INTRODUCTION 3
1.2 Memory Optimization for Embedded System
Memory systems design has always been a crucial problem for embedded system de-
sign, because system-level performance and energy consumption depend strongly on
memory system. Cache memory subsystem bears significant importance in embedded
system design as it bridges the performance gap between the fast processor and the
slow main memory. Generally, for a well-tuned and optimized memory hierarchy, most
of the memory accesses can be fetched directly from the cache instead of main mem-
ory, which consumes more power and incurs longer delay per access. In this thesis,
we focus on instruction cache, which is present in almost all embedded systems. In-
struction cache is one of the foremost power consuming and performance determining
microarchitectural features of modern embedded systems as instructions are fetched al-
most every clock cycle. For example, instruction fetch consumes 22.2% of the power
in the Intel Pentium Pro processor [23]; 27% of the total power is spent by instruction
cache for StrongARM 110 processor [70]. Thus, careful tuning and optimization of
instruction cache memory can lead to significant performance gain and energy saving.
Instruction cache performance can be improved via hardware (architecture) means
and software means. From an architectural perspective, caches can be customized for
the specific temporal and spatial localities of a given application. Caches can be config-
ured statically and dynamically. For statically configurable caches [3, 5, 8, 9], the sys-
tem designer can set the cache’s parameters in a synthesis tool, generating a customized
cache. For dynamically configurable caches [106, 10, 15], they can be controlled by
software-configurable registers such that the cache parameters can be varied dynam-
CHAPTER 1. INTRODUCTION 4
ically. From a software perspective, program can be tailored for the specific cache
architectures. Cache aware program transformations allow the modified application to
utilize the underlying cache more efficiently.
For architecture customization, the system designer can choose an on-chip cache
configuration that is suited for a particular application and customize the caches for
it. However, the cache design parameters include the size of the cache, the line size,
the degree of associativity, the replacement policy, and many others. Hence, cache de-
sign space consists of a large number of design points. The most popular approach
to explore the cache design space is to employ trace-driven simulation or functional
simulation [95, 59, 56, 106]. Although the cache hit/miss rate results are accurate, the
simulation is too slow, typically much longer than the execution time of the program.
Moreover, the address trace tends to be large even for a small program. Thus, huge trace
sizes put practical limit on the size of the application and its input. In this thesis, we
explore analytical modeling as an alternative to simulation for fast and accurate estima-
tion of cache hit rates. Analytical design space exploration could help system designer
to explore the search space quickly and come up with a set of promising configurations
along multiple dimensions (i.e., performance and energy consumption) in the early de-
sign stage. However, due to the demanding design constraints, the set of promising
configurations chosen from design space exploration may not always meet the design
objectives or the size of the cache returned from design space exploration may be too
big. Hence, we also consider software based instruction cache optimization techniques
to further improve performance.
CHAPTER 1. INTRODUCTION 5
For software solutions, since the underlying instruction cache parameters are known,
the program code can be appropriately tailored for the specific cache architecture. More
concretely, for software optimizations, we consider cache locking and procedure place-
ment. Most modern embedded processors (e.g., ARM Cortex series processors) feature
cache locking mechanisms whereby one or more cache blocks can be locked under
software control using special lock instructions. Once a memory block is locked in
the cache, it cannot be evicted from the cache under replacement policy. Thus, all
the subsequent accesses to the locked memory blocks will be cache hits. However,
most existing cache locking techniques are proposed for improving the predictability of
hard real-time systems. Using cache locking for improving the performance of general
embedded systems are not explored. We observe that cache locking can be quite effec-
tive in improving the average-case execution time of general embedded applications as
well. We propose precise cache modeling technique to model the cost and benefit of
cache locking and efficient algorithms for selecting memory blocks for locking. Pro-
cedure placement is a popular technique that aims to improve instruction cache hit rate
by reducing conflicts in the cache through compile/link time reordering of procedures.
However, existing procedure placement techniques make reordering decisions based
on imprecise conflict information. This imprecision leads to limited and sometimes
negative performance gain, specially for set-associative caches. We propose precise
modeling technique to model cost and benefit of procedure placement for both direct
mapped and set associative caches. Then we develop an efficient algorithm to place
procedures in memory such that cache conflicts are minimized.
CHAPTER 1. INTRODUCTION 6
Obviously, the ideal customized cache configurations and the software optimiza-
tion solution are determined by the characteristics of the application. The application
characteristics we use in this thesis include basic block execution count profile (branch
probability, loop bound), temporal reuse profile and intermediate blocks profile. All
these application characteristics can be easily collected through profiling. More impor-
tantly, most of these application characteristics are architecture (cache configurations)
independent. Hence, they only need to be collected once. After these application char-
acteristics are collected, they will be utilized by our subsequent analysis to derive the
optimal cache configurations and optimization solutions.
1.3 Thesis Contributions
In this thesis, we study the instruction cache optimizations for embedded systems. Our
goal is to tune and optimize instruction cache by utilizing application characteristics for
better performance as well as power consumption. Specially, in this thesis we make the
following contributions.
• Cache Modeling via Static Program Analysis. We develop a static program
analysis technique to accurately model the cache behavior of an application on
a specific cache configuration. We introduce the concept of probabilistic cache
states, which captures the set of possible cache states at a program point along
with their probabilities. We also define operators for update and concatenation of
probabilistic cache states. Then, we propose a static program analysis technique
CHAPTER 1. INTRODUCTION 7
that computes the probabilistic cache states at each point of program control flow
graph (CFG), given the program branch probability and loop bound information.
With the computed probabilistic cache states, we are able to derive the cache hit
rate for each memory reference in the CFG and the cache hit rate for the entire
program. Furthermore, modern embedded systems’ memory hierarchy consists of
multiple levels of caches. We extend our static program analysis for caches with
hierarchies too. Experiments indicate that our static program analysis achieves
high accuracy [63].
• Design Space Exploration of Caches. We present an analytical approach for
exploring the cache design space. Although the technique we propose in [63] is
a fast and accurate static program analysis that estimates cache hit rate of a pro-
gram for a specific configuration, it does not solve the problem of design space
exploration due to vast number of cache configurations in the cache design space.
Fortunately, there exist structural relations among the related cache configura-
tions [90]. Based on this observation, we extend our analytical approach to model
multiple cache configurations in one pass in chapter 5. More clearly, our analysis
method can estimate the hit rates for a set of cache configurations with varying
number of cache sets and associativity in one pass as long as the cache line size
remains constant. The input to our analysis is simply the branch probability and
loop bounds, which is significantly more compact compared to memory address
traces required by trace-driven simulators and other trace based analytical works.
We show that our technique is highly accurate and is 24 - 3,855 times faster com-
CHAPTER 1. INTRODUCTION 8
pared to the fastest known single-pass cache simulator Cheetah [64].
• Cache Locking. We develop a framework to improve the average-case program
performance through static instruction cache locking. We introduce temporal
reuse profile (TRP) to accurately and efficiently model the cost and benefit of
locking memory blocks in the cache. TRP is significantly more compact com-
pared to memory traces. We propose two cache locking algorithms based on
TRP: an optimal algorithm based on branch-and-bound search and a heuristic
approach. Experiments indicate that our cache locking heuristic improves the
state of the art in terms of both performance and efficiency and achieves close to
the optimal result [62]. We also compare cache locking with a complimentary
instruction cache optimization technique called procedure placement. We show
that procedure placement followed by cache locking can be an effective strategy
in enhancing the instruction cache performance significantly [62].
• Procedure Placement. We propose an efficient algorithm to place procedures
in memory for a specific cache configuration such that cache conflicts are mini-
mized. As a result, both the performance and energy consumption are improved.
Our efficient procedure placement algorithm is based on intermediate blocks pro-
file (IBP) that accurately but compactly models cost-benefit of procedure place-
ment for both direct mapped and set associative caches. Experimental results
demonstrate that our approach provides substantial improvement in cache perfor-
mance over existing procedure placement techniques. However, we observe that
CHAPTER 1. INTRODUCTION 9
the code layout generated for a specific cache configuration is not portable across
platforms with the same instruction set architecture but different cache configura-
tions. Such portability issue is very important in situations where the underlying
hardware platform (cache configurations) is unknown. This is true for embedded
systems where the code is downloaded during deployment. Hence, we propose
another procedure placement algorithm that generates a neutral code layout with
good average performance across a set of cache configurations.
1.4 Thesis Organization
The rest of the thesis is organized as follows. Chapter 2 will first lay the foundation
for discussion by introducing the cache mechanism. Chapter 3 surveys the state of the
art techniques related to instruction cache exploration and optimization for embedded
systems. Chapter 4 presents a static program analysis technique to model the cache
behavior of a particular application. Chapter 5 extends the static program analysis in
chapter 4 for efficient instruction cache design space exploration. Chapter 6 discusses
employing cache locking for improving average case execution time for general embed-
ded applications. Chapter 7 presents an improved procedure placement technique for set
associated caches and a procedure placement algorithm for a neutral layout with good
portability. Chapter 8 describes a systematic instruction optimization flow by integrat-
ing all the techniques developed in the thesis together. Finally, we conclude our thesis
with a summary of contributions and examine possible future directions in chapter 9.
Chapter 2
Background
In this chapter, we will look into the details of cache mechanisms including cache ar-
chitecture, various cache design parameters and cache locking.
2.1 Cache
A cache is a fast on-chip memory that stores copies of data from off-chip main memory
for faster access [77]. Cache is effective because it takes advantage of principle of
locality: programs tend to reuse data and instructions they have used recently. There
are two types of localities. Temporal locality states that recently accessed items are
likely to be accessed in the near future. Spatial locality says that items whose addresses
are near one another tend to be referenced close together in time [51].
Cache Terminology. A cache memory is defined in terms of four major parameters:
block or line size L, number of sets K, associativity A, and replacement policy. The
10
CHAPTER 2. BACKGROUND 11
t i d ff tag n ex o se
valid tag data valid tag data
… K… … … … … … Cache sets
A ( associativity )
Figure 2.1: Cache architecture.
block or line size determines the unit of transfer between the main memory and the
cache. A cache is divided into K sets. Each cache set, in turn, is divided into A cache
blocks, where A is the associativity of the cache. For a direct-mapped cache A = 1, for
a set-associative cache A > 1, and for a fully associative cache K = 1. In other words,
a direct-mapped cache has only one cache block per set, whereas a fully-associative
cache has only one cache set. Now the cache size is define as (K ×A×L). A memory
block m can be mapped to only one cache set given by (m modulo K). For a set-
associative cache, the replacement policy (e.g., LRU, FIFO, etc.) defines the block to
be evicted when a cache set is full.
Cache architecture is shown in Figure 2.1. As shown, each cache way (corresponds
to one associativity) consists of K cache lines. Each cache line consists of three parts:
data portion which contains the memory block; tag portion which is used to differentiate
the possible memory blocks mapped to the same cache set; valid bit which is used to
indicate whether or not this entry contains a valid address. Given a memory address
CHAPTER 2. BACKGROUND 12
reference, the address is divided into three fields as shown in Figure 2.1. The index
field determines the cache set to which this address is mapped; the tag field is used to
determine whether the referenced address is contained in the cache (true if the tag field
matches the tag portion of the corresponding line); and the offset field is used to select
the desired data from the cache line or block. When the cache receives the address from
the processor, all the A cache ways will be searched simultaneously and the address
reference is a cache hit if the requested address is found in one of the A cache ways. If
a cache miss happens, the address reference will be directed to the main memory and
the memory block fetched from the main memory will be placed in the cache.
2.2 Cache Locking
Most modern embedded processors (e.g., ARM Cortex series processors) feature cache
locking mechanisms whereby one or more cache blocks can be locked under software
control using special lock instructions. Once a memory block is locked in the cache, it
cannot be evicted from the cache under replacement policy. Thus, all the subsequent
accesses to the locked memory blocks will be cache hits. Only when the cache line is
unlocked, the corresponding memory block can be replaced. Since the locked mem-
ory blocks are guaranteed to be cache hit, the latencies of the accesses to the locked
memory blocks are constant. Thus, cache locking is commonly used to improve the
timing predictability of hard real-time embedded systems. Cache locking mechanism
is present in quite a number of modern commercial processors, for example, ARM pro-
CHAPTER 2. BACKGROUND 13
cessors series [4, 6], IBM powerPC 440 [7], Intel’s Xscale [1], BlackFin 5xx family
processors [2] and Freescale e300 [84].
Two locking mechanisms are commonly used in modern embedded processors —
way locking and line locking. In way locking, particular ways of a set associative cache
are selected for locking and these ways are locked for all the cache sets. Way-locking
is employed by ARM processor series [4, 6]. Compared to way locking, line locking
is a fine grained locking mechanism. In line locking, different number of lines can be
locked for different cache sets. Line locking is employed by Intel Xcale [1], ARM9
family and Blackfin 5xx family processors [2].
There are two possible locking schemes — static cache locking and dynamic cache
locking. In static cache locking scheme, the selected memory blocks are locked once
before the start of the program and remain locked during the entire execution of the pro-
gram. The additional locking instructions are executed only once. Thus, the overhead
of locking is negligible. In dynamic cache locking scheme, the memory blocks to be
locked can be changed at chosen execution points. Locking instructions are inserted at
appropriate program points for reloading the cache. Certainly, the overhead of reload-
ing in dynamic cache locking scheme is not negligible and has to be taken into account
in the total execution time computation.
Chapter 3
Literature Review
In this chapter, we present an overview of existing research on application specific em-
bedded system design with emphasis on memory subsystems, design space exploration
of caches and instruction cache optimization techniques including cache locking and
code layout reorganization.
3.1 Application Specific Memory Optimization
The optimization techniques proposed by computer architecture and compiler commu-
nity for general purpose computing systems are still beneficial to embedded systems.
More importantly, application characteristics and architectural flexibility open up a new
dimension to explore for embedded systems. Application specific memory customiza-
tions and optimizations typically incorporate and utilize application characteristics so
as to achieve power and performance improvements. This leads to various novel archi-
14
CHAPTER 3. LITERATURE REVIEW 15
tectures and compilation optimizations such as application specific memory hierarchy
design and architecture aware compilation.
In the last decade, optimizing cache memory design for embedded systems has re-
ceived a lot of attention from the research community [75, 106, 10, 15, 86, 104, 69, 59,
60, 88, 18, 78, 19, 96]. In this thesis, we focus on design space exploration of caches —
determining the best instruction cache parameters from vast number of cache configu-
rations for a given application and software optimizations — instruction cache locking
and procedure placement.
3.2 Design Space Exploration of Caches
One of the most effective cache optimizations is to tune cache parameters for the spe-
cific application. The tuning process is done through cache design space exploration.
More concretely, for application specific embedded system, we can choose specific
cache configuration from the huge cache design space to meet the design constraints
(i.e., performance, energy and hardware area) required by the specific application. Fur-
thermore, all the analytical performance and energy models need the cache hits/mises
of a cache configuration as inputs [59, 86, 106] to predict the performance and energy
consumption. To obtain the cache hits/misses for each cache configuration, we can rely
on detailed trace driven simulation, analytical modeling, or hybrid approach using both
simulation and analytical modeling.
CHAPTER 3. LITERATURE REVIEW 16
3.2.1 Trace Driven Simulation
Trace-driven simulation is widely used for evaluating cache design parameters [95].
The collected application trace is fed to the cache simulator which mimics the behavior
of some hypothetical cache configurations and outputs the cache performance metrics
such as cache hit/miss rate. However, complete trace simulation could be very slow
and sometimes is not necessary. Hence, lossless trace reduction techniques have been
described in [100, 103]. Wang and Baer observed that the references that hit in small
direct mapped cache will hit in larger caches. They exploited the observation to remove
certain references from the trace before simulation [100]. In [103], cache configura-
tions are simulated in a particular order in order to strip off some redundant informa-
tion from the trace after each simulation. However, both of these techniques still need
multiple passes of simulation. Single pass simulation techniques have been proposed
in [90, 52, 68]. Based on the inclusion property that roughly states that the content
of a smaller cache is included in a bigger cache for certain replacement policy, multi-
ple cache configurations can be evaluated simultaneously during a single pass. Various
data structures, such as single stack [68], forest [52], and generalized binomial tree [90],
have been proposed for utilizing the inclusion property. Cheetah [90] is shown to be the
most efficient single pass simulator so far. However, address traces could be very big
even for a small program and they have to be compressed for practical usage. Simu-
lation methodology that operates directly on a compressed trace have been presented
in [54, 57]. Recently, Mohammad et al. proposed a fast simulation framework — SuS-
eSim, to find the optimal L1 cache configuration for embedded systems [47]. SuSeSim
CHAPTER 3. LITERATURE REVIEW 17
is a single pass multiple cache configurations analysis tool. However, it is not clear how
fast SuSeSim is compared to the fastest single pass multiple configurations simulator
Cheetah and address trace is still needed for their technique. Our analytical model in
chapter 5 is shown to be much faster than Cheetah and does not need address trace.
3.2.2 Analytical Modeling
Analytical modeling has been proposed as an alternative to trace-driven simulation.
Cascaval and Padua [26] described an analytical model for estimating cache perfor-
mance based on stack distance. Stack distance accurately models fully associative
caches with LRU replacement policy. However, the accuracy could be significantly
low for set associative caches as shown in [26]. Harper et al. [49] proposed an ana-
lytical model for set-associative caches. Their model, applicable to numerical codes
mainly consisting of array operations, can predict the cache miss rate through an exten-
sive hierarchy of cache reuse, interference effects and numerous forms of temporal and
spatial locality. There are works that estimate data cache behavior by formulating math-
ematical equations [36, 27]. All the aforementioned analytical approaches are restricted
to the applications without data dependent conditionals and indirections. Given an ad-
dress trace, [83, 20] proposed probability based analytical models to compute cache
hit rate. But their approaches are mainly for direct mapped caches. More importantly,
all the above analytical models focus on performance estimation and optimization of
a specific cache configuration. Thus, they do not solve the problem of design spaces
exploration of caches.
CHAPTER 3. LITERATURE REVIEW 18
There are only a few approaches that use analytical modeling approaches to per-
form design space exploration for embedded systems [35, 34, 74, 38]. Panda et al. [74]
firstly presented an analytical strategy for exploring the on-chip memory architecture
for a given application. Their analytical model could quickly determine a combination
of scratch-pad memory and data cache, and the appropriate line size, based on the anal-
ysis of the given application. However, the data cache is limited to direct mapped cache
and the memory accesses have to be regular array accesses. Givargis et al. presented
an analytical system-level exploration approach for pareto-optimal configurations in pa-
rameterized embedded systems [38]. However, for memory subsystem, it is based on an
exhaustive search using simulations. Ghosh and Givargis [35, 34] proposed an efficient
analytical approach for design space exploration of caches. Given the application trace
and desired performance constraint, the analytical model generates the set of cache
configurations that meet the performance constraints directly. However, as described
in [35, 34], for realistic cache design parameters (limited associativity), the proposed
analytical model is as slow as trace simulation.
3.2.3 Hybrid Approach
Hybrid approaches are used to explore both single and multi-level caches design space [43,
44, 87, 37, 73]. For all hybrid approaches, simulations are employed to obtain the cache
hits/misses for only a subset of cache design space. Then, various heuristics are used
to predict the cache hits/misses of other design points or prune the exploration search
space. Givargis et al. presented an exploration technique for parameterized cache and
CHAPTER 3. LITERATURE REVIEW 19
bus together [37]. In their technique, some cache performance data are collected via
simulation first. Then, simple equations are used to predict the performance of other
configurations. Palesi and Givargis [73] applied Genetic Algorithms (GA) for design
space exploration to discover pareto-optimal configurations representing design objec-
tive tradeoffs (e.g., performance and power). Simulations are still needed in the evolu-
tion process of GA to derive the objective values of a configuration. Gordon-Ross et
al. developed heuristics for exploring a second level cache with separate and unified in-
struction and data cache [43, 44]. Again, simulations are needed. Dynamic cache tuning
relying on dynamically adjustable configurable cache is proposed in [42]. Selecting a
subset of cache configurations from huge cache design space for effective cache tuning
is described in [99].
All the hybrid techniques are complementary to our techniques in chapter 5 since
they can prune the design space efficiently and our methods can estimate the perfor-
mance of cache configurations accurately and efficiently. However, in most cases, the
cache design space is pruned due to some obvious fact — big caches performs better
than small caches. Given cache configurations with fixed size but different associativ-
ity and number of cache sets, the heuristics proposed in hybrid approaches may not be
effective because there is no straightforward correlations among these configurations.
However, our technique in chapter 5 is still fast because it captures the structural rela-
tions among the configurations. Finally, hybrid approaches may be slow as well because
simulations are still needed.
CHAPTER 3. LITERATURE REVIEW 20
3.3 Cache Locking
Cache locking was primarily designed to offer better timing predictability for hard real-
time applications. Hence, the compiler optimization techniques focus on employing
cache locking to improve worst-case execution time. However, cache locking can be
quite effective in improving the average-case execution time of general embedded ap-
plications as well. In the following, we will summarize the techniques of employing
cache locking for improving the timing predictability for hard real-time systems and
the average-case performance of general embedded systems.
3.3.1 Hard Real-time Systems
Instruction cache locking has been employed in hard real-time systems for better tim-
ing predictability [80, 25, 30, 66]. In hard real-time systems, worst case execution time
(WCET) is an essential input to the schedulability analysis of mutli-tasking real-time
systems. It is difficult to estimate a safe but tight WCET in the presence of complex
micro-architectural features such as caches. By statically locking instructions in the
cache, WCET becomes more predictable. Puaut and Decotigny proposed two low-
complexity algorithms for static cache locking in a multi-tasking environment [80].
System utilization or inter-task interferences are minimized through static cache lock-
ing [80]. Campoy et al. employed generic algorithms to select contents for locking
in order to minimize system utilization [25]. However, the WCET path may change
after some functions are locked into the instruction cache and the change of the WCET
CHAPTER 3. LITERATURE REVIEW 21
path is not handled in [80, 25]. Falk et al. considered the change of the WCET path
and showed better WCET reduction [30]. All the techniques [80, 25, 30] are heuristic
based approaches. Liu et al. [66] formulated the instruction cache locking for mini-
mizing WCET as linear programming model and showed that the problem is NP-Hard
problem. In addition, for a subset of programs with certain properties, polynomial time
optimal solutions are developed in [66]. Locking has also been applied to shared caches
in multi-cores environment in [92].
Data cache locking algorithms for WCET minimization are presented in [97, 98].
Based on the extended reuse vector analysis [102], cache miss equations [36] are formu-
lated to find those data reuses that translate to cache misses. For those data reuses that
can not be analyzed statically due to data dependencies, heuristics that lock frequent
data accesses are used. However, in [97, 98], WCET path is not considered.
3.3.2 General Embedded Systems
Cache locking can be quite effective for improving average-case execution time for
general embedded applications as well. Data cache locking mechanism based on the
length of the reference window for each data access instruction is proposed in [105].
However, they do not model the cost/benefit of locking and there is no guarantee of
performance improvement. Recently, Anand and Barua proposed an instruction cache
locking algorithm for improving average-case execution time in [12]. However, there
are mainly two disadvantages of their technique. First, Anand and Barau’s approach
relies on trace driven simulation to evaluate the cost and benefit of cache locking. How-
CHAPTER 3. LITERATURE REVIEW 22
ever, trace driven simulation could be very slow, typically longer than execution time of
the program [95]. More importatnly, in Anand and Barau’s method, two detailed trace
simulations are employed in each iteration where one iteration locks one memory block
in the cache. Such extensive usage of simulation is not feasible for large programs or
large caches. Secondly, in their method, cache locking benefit is approximated by lock-
ing dummy blocks to keep the number of simulations reasonable. Thus, the cost and
benefit of cache locking are not precisely calculated in [12].
In chapter 6, we introduce temporal reuse profile to model cache behavior. Pre-
viously, reuse distance has been proposed for the same purpose [21, 28, 22]. Reuse
distance is defined as the number of distinct data accesses between two consecutive
references to the same address and it accurately models the cache behavior of a fully
associative cache. However, to precisely model the effect of cache locking, we need the
content instead of the number (size) of the distinct data accesses between two consecu-
tive references. Temporal reuse profile in chapter 6 records both the reuse content and
their frequencies.
3.4 Code Layout
Reorganizing instructions in the memory to improve instruction cache performance has
been around for more than a decade. Techniques that rearrange code can perform at
basic block and procedure level.
CHAPTER 3. LITERATURE REVIEW 23
Basic block level placement techniques usually form sequence of basic blocks that
tend to execute together frequently according to profiling information and place them
together. Tomiyama and Yasuura proposed an ILP solution to find the optimal place-
ment and a refined method aiming to reduce code size [94]. Parameswaran and Henkel
described a fast instruction code placement heuristic to reduce cache misses for perfor-
mance and energy in [76]. Compared to procedure placement, basic block level place-
ment usually gives better results due to its fine granularity. However, basic block level
placement involves modifying the application assembly code by inserting additional
instructions (i.e., jump instructions).
Earlier procedure placement techniques build procedure call graph to model the
conflicts among procedures, where the vertices are the procedures and the edges are
weighted by the number of calls between two procedures [53, 79]. The edge weights be-
tween two procedures are used to estimate the cache conflicts between two procedures.
Conflicting procedures will be placed next to each other. As a result, the conflicts due
to overlap among procedures are reduced. However, the underlying cache parameters
are not taken into account. Thus, the code layout generated may not be suitable for a
specific cache configuration.
By taking cache parameters (line size, cache size) into account, an improved pro-
cedure placement technique is proposed in [50]. The algorithm maintains the set of
unavailable cache locations (colors) for each procedure. The colors are used to guide
procedure placement. Later on, the technique is extended to model indirect procedure
calls and uses cache lines instead of whole procedures to model conflicts [55], which
CHAPTER 3. LITERATURE REVIEW 24
leads to more performance gain. Gloy et al. in [39, 40] built temporal relationship graph
(i.e., which procedures are referenced between two consecutive accesses to another pro-
cedure). Working on this more detailed graph, also considering the cache size and line
size, they have shown better results than [53, 79] that neglect the cache parameters. For
all the above techniques [50, 55, 39, 40], the conflict metric is just an approximation
of conflict misses and is designed for direct mapped cache. Recently, Bartolini and
Prete propose a precise procedure placement technique [17] using detailed trace driven
simulation to evaluate the effect of procedure placements. In their work, the number
of simulations required increases linearly with the number of procedure and cache size.
However, detailed simulation could be extremely slow [95], even if the trace is slightly
compressed. Thus, simulation based approach is not feasible for not so small appli-
cations, long trace, or large cache size. On the contrary, our technique in chapter 7 is
based on the compact intermediate blocks profile that models the cache accurately and
efficiently.
Existing procedure placement techniques allow gaps among procedures to improve
cache performance. This leads to code size expansion. Although various simple heuris-
tics have been proposed to reduce the code size in [50, 55, 39, 40, 17], the code size
still could expand significantly as shown in [45]. Thus, the cache performance is im-
proved at the cost of code size expansion. Such huge code size expansion makes these
techniques unusable in the context of embedded systems. Guillon et al. extend the tech-
nique in [40] to deal with code size. They introduce a parameter to guide the tradeoff
between performance and code size. They also develop a polynomial time optimal al-
CHAPTER 3. LITERATURE REVIEW 25
gorithm to minimize the code size. It is shown that good performance is still achieved
but with a small code size expansion [45]. However, the technique developed in [45] is
mainly for direct mapped cache and the cache miss is modelled using imprecise conflict
information. In addition, all the existing procedure placement techniques do not solve
the portability problem — the layout generated for a specific configuration may not be
portable across different cache configurations.
Code reordering techniques have also been used in multi-tasking embedded sys-
tems. The starting address of a task can be changed to minimize the cache conflict
misses among tasks [41, 33, 60]. Procedure placement has been exploited to optimize
worst case execution time too [67].
3.5 Cache Modeling for Timing Analysis
Worst Case Execution Time (WCET) is an essential input for schedulability analysis
of hard real-time systems [82, 93]. Due to the safety nature of hard real-time system,
the estimation of WCET has to be both safe and tight. In addition, modern complicated
architecture features such as caches make the program execution time difficult to predict
and these features have to be taken into account in order to derive a safe WCET. Thus,
caches have been modelled by real-time community as well.
Abstract interpretation has been exploited for estimating WCET [11, 32, 31]. Li
et al. [61] modeled cache using cache conflict graph and formulated the estimation of
WCET as an Integer Linear Programming (ILP) problem. WCET estimation techniques
CHAPTER 3. LITERATURE REVIEW 26
based on the categorization of cache accesses are proposed in [71, 13]. Lim et al.
modeled instruction cache using timing schema in [65]. Data cache is modelled to
derive tight WCET too [101, 85]. Finally, scratchpad allocation techniques that aim to
minimize WCET is presented in [91]. All the techniques are based on static analysis
because the worst case input is unknown and safety is guaranteed by static analysis.
Chapter 4
Cache Modeling via Static Program
Analysis
Trace-driven simulation is widely used for evaluating cache performance of a given
application on a specific cache architecture. However, simulation based techniques tend
to be slow. In this chapter, we explore analytical approach as an alterative to explore
cache architectures. Our analytical approach is shown to be fast and accurate.
4.1 Introduction
Instruction cache plays a critical role in embedded systems in terms of both performance
and energy. Instruction cache hit/miss rate is one of the key factors that affects the
system performance and energy consumption. To evaluate cache performance, we can
rely on both simulation and analytical approaches.
27
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 28
Trace-driven simulation is widely used for evaluating cache performance [95]. The
collected application trace is fed to the cache simulator that mimics the behavior of
some hypothetical cache configurations and outputs the cache performance metrics such
as cache hit/miss rate. Unfortunately, simulation based approaches are too slow and
huge trace sizes put practical limit on both the size of the application and its input. An-
alytical approach is proposed as an alternative to simulation for exploration of memory
systems. Analytical approach can be used during the early design stage when the simu-
lation and execution environment are not available. Furthermore, analytical approach is
supposed to be very fast by its nature. In this work, we consider analytical approach for
evaluating cache performance. We first introduce probabilistic cache state that captures
all the possible cache states at a program location associated with their probabilities.
We also define the corresponding operations for probabilistic cache states. Then, we
present a static program analysis technique to accurately model the cache behavior of
a specific cache configuration. Our analysis only needs basic block/control flow edge
execution count profile information of the application that is much smaller than the
address trace needed by trace-driven simulation and other analytical techniques.
4.2 Analysis Framework
The inputs to our analysis framework are the executable program code and its corre-
sponding input. We can obtain the basic block and control flow edge counts through
execution or quick functional simulation of an instrumented version of the program.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 29
B0(1) Pre-header
B1(100)




    
B4(100)






Figure 4.1: Annotated control flow graph. Each basic block is annotated with its execution
count. Each edge is associated with its execution count and frequency (probability). For exam-
ple, the execution count of basic block B2 is 40 and the execution count of edge B2 → B4 is
40 too. The edge (B2→ B4) probability is 0.4.
The instrumentation can be done very efficiently by using edge profiling [16]. More
importantly, the profiling needs to be done only once, as basic block and edge execu-
tion counts remain unchanged across different cache configurations.
Our analysis first constructs the loop-procedure hierarchy graph (LPHG) corre-
sponding to the whole program [58]. The LPHG represents the procedure calls and
loop nest relations in the program. Loop and procedure bodies are represented as di-
rected acyclic graphs (DAG), where the nodes of a DAG are the basic blocks. If a loop
(procedure) contains other loops within its body, then the inner loops are represented
as dummy nodes in the DAG. For each loop L, it is annotated with its loop count NL
and its control flow graph is transformed such that every loop has a loop pre-header,
post-loop, start, and end node.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 30
Given a basic block B and an edge B′ → B, we use NB and NB′→B to denote
their execution counts, respectively. For control flow edge B′ → B, the edge fre-
quency f(B′ → B) is defined as the probability that B is reached from B′, that is,




e∈In(B) f(e) = 1, where In(B) represents
all the incoming edges of B. Figure 4.1 shows an example of annotated control flow
graph and all the basic blocks and control flow edges are annotated with their execution
information.
Cache Hit Rate. Let us use B to represent the set of the basic blocks of the program
and Rhit to represent the cache hit rate of the program. Let IB be the number of in-









where Hm is the cache hit rate of the mth memory block access in MB. NB and IB
are constants across different cache configurations and are available through profiling.
However, Hm is unknown and may change across different cache configurations. In the
following, we will illustrate how to estimate Hm through our cache modeling.
1The edge (loop pre-header to loop start) frequency is 1.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 31
4.3 Cache Modeling
In this section, we describe our cache modeling approach for a specific configuration.
Assumptions. Without loss of generality, we will limit our discussion to a fully as-
sociative cache. A set-associative cache with associativity A can be easily modeled
by modeling each cache set as a fully associative cache containing A blocks. Let Mi
denote the set of all the memory blocks that can map to the ith cache set. Clearly⋂K−1
i=0 Mi = φ. Thus, there is no interference among the cache sets and they can be
modeled independently.
More concretely, in the following, we consider a fully-associative cache with A
cache blocks and the program store as a set of memory blocks M . To indicate the
absence of any memory block in a cache line, we introduce a new element ⊥. We
consider LRU (least recently used) replacement policy, where the block replaced is the
one that has been unused for the longest time.
4.3.1 Concrete Cache States
Let us first formally define the concrete cache states and their corresponding operations.
These definitions will be used later to introduce the notion of probabilistic cache states.
Definition 1 (Concrete Cache States). A concrete cache state c is a vector 〈c[1], . . . , c[A]〉
of length A where c[j] ∈ M ∪ {⊥}. If c[j] = m, then m is the jth most recently used
memory block in the cache. Ω denotes the set of all possible concrete cache states.
We also define a special concrete cache state c⊥ = 〈⊥, . . . ,⊥〉 called the empty cache
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 32
state. Figure 4.2 shows some of the concrete cache states corresponding to the control
flow graph.




1 if ∃j (1 ≤ j ≤ A) s.t. c[j] = m
0 otherwise
Definition 3 (Concrete Cache State Update). We define C as concrete cache state
update operator. Given a concrete cache state c ∈ Ω and a memory block m ∈M ∪{⊥
}, cCm defines the cache state after memory access m following LRU policy.
cCm =

c, if m =⊥
c′, where c′[1] = m;
c′[j] = c[j − 1], 1 < j ≤ k
c′[j] = c[j], k < j ≤ A if ∃k s.t. c[k] = m
c′, where c′[1] = m;
c′[j] = c[j − 1], 1 < j ≤ A otherwise
4.3.2 Probabilistic Cache States
At any program point, the concrete cache state is dependent on the program path taken
before reaching this program point. In general, a program point can be reached through
multiple program paths leading to a number of possible cache states at that point. We
have to model the probability of each of these cache states. For this purpose, we intro-
duce the notion of probabilistic cache states.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 33
















m4 B4 Probabilistic cache state update after B4
⎨⎧ =⊥><= 5.0,,0,1,44 prmmmC outB5.0)42( =→ BBf ⎩ =>< 5.0,0,2,3,4 prmmmm5.0)43( =→ BBf
Figure 4.2: Control flow graph consists of two paths with equal probability (0.5). The illus-
tration is for a fully-associative cache with 4 blocks starting with empty cache state. m0–m4
are the memory blocks. Two probabilistic cache states before B4 are shown. The probabilistic
cache states merging and update operation are shown for B4.
Definition 4 (Probabilistic Cache States). A probabilistic cache state C is a 2-tuple:
〈C,X〉, whereC ∈ 2Ω is a set of concrete cache states andX is a random variable. The
sample space of the random variable X is the set of all possible concrete cache states
Ω. Given a concrete cache state c, we define Pr[X = c] as the probability of the cache
state c in C. If c /∈ C, then Pr[X = c] = 0. By definition, (∑c∈Ω Pr[X = c]) = 1.
Finally, we define a special probabilistic cache state C⊥ denoting the empty cache state.
That is C⊥ = 〈{c⊥}, X〉, where Pr[X = c⊥] = 1.
Definition 5 (Cache Hit/Miss Probability). Given a probabilistic cache state C =






CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 34
In other words, we add up the probability of all the concrete cache states c ∈ C that
contain the memory block m. The cache miss probability can now be defined as
PMiss(C,m) = 1− PHit(C,m)
Definition 6 (Probabilistic Cache State Update). We define E as the probabilistic
cache state update operator. Given a probabilistic cache state C = 〈C,X〉 and an
access to memory block m ∈M , C Em defines the updated probabilistic cache state.
C Em = C ′ where C ′ = 〈C ′, X ′〉
C ′ = {c / m|c ∈ C}




For example, in Figure 4.2, the probabilistic cache state at the end of basic block B4
(starting with empty cache state) consists of two concrete cache states with equal prob-
ability 0.5. The cache miss probability of memory blocks m1–m3 in this probabilistic
cache state is 0.5 whereas the miss probability of m0 and m4 are 0.
4.4 Static Cache Analysis
In this section, we first describe cache analysis for a loop in isolation, i.e., we assume
an empty cache state at the loop entry point. Subsequently, we will extend this analysis
to the whole program. In the following, we consider the control flow graph (CFG) to
be a directed acyclic graph (DAG), representing the body of the loop. We first perform
the analysis on the DAG to model cache behavior for a single iteration of a loop. This
will be followed by probabilistic cache state modeling across iterations.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 35
4.4.1 Analysis of DAG
Let CinB and CoutB be the incoming and outgoing probabilistic cache states of a basic
block B. Similarly, CinL and CoutL denote the incoming and outgoing probabilistic cache
states of a loop L. Let start and end be the unique start and end basic blocks of the
DAG corresponding to the loop body. Then CinL = Cinstart and CoutL = Coutend. As we are
analyzing the loop in isolation at this point, CinL = C⊥. We relax this constraint in the
next section.
Let genB = 〈m1, . . . ,mk〉 be the sequence of memory blocks accessed within a
basic block B. Then
CoutB = CinB Em1 E . . .Emk (4.2)
That is, the outgoing probabilistic cache state of a basic block can be derived by re-
peatedly updating the incoming probabilistic cache state with the memory accesses in
B. Now in order to generate the incoming cache state of B from its predecessor cache
states, we need to define the following new operator.
Definition 7 (Probabilistic Cache States Merging). We define
⊕
as the merging oper-
ator for probabilistic cache states. It takes in n probabilistic cache states Ci = 〈Ci, Xi〉
and a corresponding weight function w as input s.t.
∑n
i=1w(Ci) = 1. It produces a
merged probabilistic cache state C as follows.
⊕




Pr[X = c|c ∈ C] =
∑
∀i,c∈Ci
Pr[Xi = c]× w(Ci)
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 36
In other words, the concrete states in C is the union of all the concrete cache states
in C1, . . . , Cn. The probability of a concrete cache state c ∈ C is a weighted summation
of the probabilities of c in the input probabilistic cache states.
Let in(B) define the set of predecessors basic blocks. Then, we can derive the
incoming probabilistic cache state of B by employing the merging operation
⊕
on
the outgoing probabilistic cache states of in(B). We define the weight function w as
w(CoutB′ ) = f(B′ → B), where B′ ∈ in(B) is a predecessor of block B. Then given
in(B) = {B′, B′′, . . .}
CinB =
⊕
(CoutB′ , CoutB′′ , . . . , w) (4.3)
Figure 4.2 shows the merging operator at the input of B4. There are two probabilistic
cache states CoutB2 and CoutB3 at the entry of B4. As the two incoming edges to B4 have
equal probability, the resulting probabilistic cache state at the entry of B4 contains CoutB2
and CoutB3 with equal probability. The output probabilistic cache state CoutB4 is obtained by
updating input probabilistic cache state CinB4 with memory block m4 inside B4.
Hit Rate Computation. Recall that genB = 〈m1, . . . ,mk〉 is the sequence of mem-
ory blocks accessed within a basic block B. Now let us define k random variables
Y1, . . . , Yk corresponding to the memory blocks m1, . . . ,mk in genB. Yi denotes the
cache hit/miss event for the access of memory block mi. Now Yi can be modeled as a
random variable with Bernoulli distribution by assuming Yi = 1 if mi is a cache miss
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 37
and Yi = 0 otherwise.
Pr[Y1 = 1] = PMiss(CinB ,m1)
Pr[Yi = 1] = PMiss(CinB Em1 . . .Emi−1,mi), 1 < i ≤ k
Pr[Yi = 0] = 1− Pr[Yi = 1], 1 ≤ i ≤ k
By definition of Bernoulli distribution, the hit rate of memory blockmi can be computed
as Pr[Yi = 1].
4.4.2 Analysis of Loop
In the previous section, we have derived the incoming and outgoing probabilistic cache
states of each basic block for a single iteration of the loop body starting with the empty
cache state CinL = C⊥. However, for a loop iterating multiple times, the input cache state
at the start node of the loop body is different for each iteration. More concretely, let
us add the subscript 〈n〉 for the nth iteration of the loop. Then Cinstart〈n〉 = Coutend〈n−1〉 for
n > 1. However, in order to compute Cinstart〈1〉, . . . , Cinstart〈N〉, whereN is the loop bound,
we do not need to traverse the DAG N times. Instead, we introduce two new operators.
Definition 8 (Concatenation of Concrete Cache States). Given two concrete cache
states c1, c2
c1  c2 = c where c = c1 C c2[A] . . .C c2[1]
Definition 9 (Concatenation of Probabilistic Cache States). Given probabilistic cache
states C1 = 〈C1, X1〉 and C2 = 〈C2, X2〉
C1
⊙ C2 = C where C = 〈C,X〉
C = {c|c = c1  c2, c1 ∈ C1, c2 ∈ C2}
Pr[X = c] =
∑
c1∈C1,c2∈C2,c=c1c2
(Pr[X1 = c1]× Pr[X2 = c2])
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 38
Let us assume the execution of two program fragments sequentially each starting
with an empty cache state. The probabilistic cache state after the execution of the first
and second program fragments are C1 and C2, respectively. Then the probabilistic cache
state after execution of the two program fragments sequentially is C1
⊙ C2.
Now we can compute the outgoing probabilistic cache state of a loop L for each
iteration by applying the
⊙
operator. First, we note that Cinstart〈1〉 = CinL = C⊥. Then for




The final probabilistic cache state after N iterations starting with empty cache state
CinL = C⊥, is denoted as CgenL where
CgenL = Coutend〈N〉 (4.5)
The hit/miss of memory blocks in basic block B is dependent on the input proba-
bilistic cache state CinB of the corresponding basic block B, which in turn is dependent
on Cinstart〈n〉 of the loop L. To compute these probabilities for each memory block in each
iteration is computationally expensive and is equivalent to complete loop unrolling. In-
stead, we observe that we only need to compute an “average” probabilistic cache state
CavgL at the start node of the loop body. This captures the input cache state of the loop
over N iterations.
CavgL can be defined as
CavgL =
⊕
(Cinstart〈1〉, . . . , Cinstart〈N〉, w) (4.6)
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 39
where w(Cinstart〈n〉) = 1N . Now, in Section 4.4.1, we simply replace Cinstart = C⊥ with
Cinstart = CavgL . The rest of the analysis for the DAG remains unchanged.
More importantly, for any cache configuration, the operator
⊙
need not be invoked
N times in practice. The probabilistic cache states converge very quickly for most
loops. After convergence point, both the content of probabilistic cache state and its
associated probability do not change.
4.4.3 Special case for Direct Mapped Cache
The computation of CavgL and CgenL , as discussed earlier, is quite general and works for
fully associative, set-associative, as well as direct mapped caches. However, for direct
mapped caches (where A = 1), the computation of average probabilistic cache state is
much simpler. As mentioned earlier, in a direct mapped cache with K cache sets, each
cache set is treated independently. Let Mi denote the set of all the memory blocks that
can map to the ith cache set. Then, a concrete cache state c corresponding to the ith
cache set is a vector 〈c[1]〉 of length 1, where c[1] ∈Mi ∪ {⊥}.
As before, we assume CinL = C⊥ and let Coutend〈n〉 = 〈C〈n〉, X〈n〉〉 for 1 ≤ n ≤ N
be the outgoing cache state after nth iteration. It is easy to see that if associativity
A = 1, then given any two iterations n, n′ where n 6= n′, the set of concrete cache states
remain unchanged, i.e., C〈n〉 = C〈n′〉. Moreover, if c⊥ /∈ C〈n〉, then the probability
distribution function remains unchanged across iterations as well. That is, Pr[X〈n〉 =
c] = Pr[X〈n′〉 = c] for any concrete cache state c. If c⊥ ∈ C〈n〉, then the probability
distribution function changes across iterations. Let us assume that the concrete cache
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 40
state c ∈ C〈n〉 contains the memory block m ∈Mi. That is, c[1] = m. Also, let p be the
probability of this cache state after first iteration, i.e., Pr[X〈1〉 = c] = p. Similarly, let q
be the probability of the empty cache state after first iteration, i.e., Pr[X〈1〉 = c⊥] = q.
Now what is the probability of cache state c after n iterations? It is the summation of (1)
the probability of the corresponding memory block m being accessed in nth iteration,
and (2) the probability of cache state c after n−1 iterations and no memory block being
access in nth iteration. Thus
Pr[X〈n〉 = c] = p+ q × Pr[X〈n−1〉 = c] (4.7)
By solving this recursion with the base case Pr[X〈1〉 = c] = p,
Pr[X〈N〉 = c] = p · 1−qN1−q
N−1∑
n=1
Pr[X〈n〉 = c] = p · (N − 1)(1− q)− q(1− q
N−1)
(1− q)2
For c⊥, it is in cache after n iterations if no memory block is accessed in the previous n
iterations, therefore the subsequent equations hold
Pr[X〈N〉 = c⊥] = qN
N−1∑
n=1
Pr[X〈n〉 = c⊥] =
q · (1− qN−1)
1− q
CavgL is defined as the the average input probabilistic cache state at loop entry. Thus,
based on Equation 4.6 and 4.4, the average probabilistic cache state at loop entry CavgL =
〈C,X〉 can be computed as follows with CinL = C⊥
C = C〈1〉
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 41
Pr[X = c] = 1
N
(














N · (1− q)
The final probabilistic cache state after N iterations starting with empty cache state
CinL = C⊥, CgenL is
C = C〈1〉
Pr[X = c] = Pr[X〈N〉 = c] = p · 1− q
N
1− q
Pr[X = c⊥] = Pr[X〈N〉 = c⊥] = qN
4.4.4 Analysis of Whole Program
So far we have assumed that the execution of a loop starts with an empty cache state.
In this section, we show how to compute the probabilistic cache state in the context
the whole program. Recall that CgenL represents the final cache state of loop L after N
iterations starting with an empty cache state. Also, we use CavgL to denote the average
probabilistic cache state at loop entry across N iterations, again assuming that the loop
L is executing in isolation. If L executes in isolation, then average probabilistic cache
state of a basic block in loop L can be computed by starting with the cache state CavgL .
Now, if CinL is the initial cache state for loop L in the context of the whole program,
then the average probabilistic cache state of a basic block in loop L is computed by
simply starting with the cache state CinL
⊙ CavgL . The analysis of the whole program then
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 42
⊥C























L CC L1 L1 L1=
(a) Inner loop analysis (b) Outer loop analysis (c) Whole program analysis
Figure 4.3: Analysis of whole program.
requires computing the initial probabilistic cache states for all the loops and procedures
in the program.
In order to compute the initial cache states, we construct the loop-procedure hier-
archy graph (LPHG) for the whole program. The LPHG represents the procedure call
and loop nest relations in the application. We first traverse the LPHG in bottom-up
fashion, i.e., we start with the innermost loops/procedures and compute CgenL and CavgL
for all such loops/procedures as shown in Figure 4.3(a). Next, we replace the innermost
loops/procedures with “dummy” nodes in the DAG of the enclosing loop/procedure.
While traversing the DAG of the enclosing loop/procedure, special care is taken for the
dummy nodes. Let CinL be the input cache state for dummy node L during traversal of
the DAG. Then we treat the dummy node as a black box and compute the output cache
state of the dummy node as CoutL = CinL
⊙ CgenL as shown in Figure 4.3(b). At the end
of this bottom-up traversal process, we reach the root node (main procedure). We have
already computed CgenL and CavgL for all loops/procedures. Now we perform a top-down
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 43
traversal to compute the cache state at each basic block in the context of the whole pro-
gram. Suppose L is a dummy node in main with input cache state CinL and start node
start. Then we traverse the DAG of L starting with Cinstart = CinL
⊙ CavgL as shown in
Figure 4.3(c) and compute the probabilistic cache state at each node of the DAG. This
top-down process continues till we reach all the innermost loops. At this point, we have
computed the “average” probabilistic cache state for each basic block in the context of
the whole program.
4.5 Cache Hierarchy Analysis
Modern embedded systems’ memory hierarchy consists of multi-level caches. In this
section, we extend our techniques for multi-level non-inclusive instruction caches [48],
where the following properties hold:
• A memory access is searched in the cache level L if and only if it is a cache miss
in the cache level L− 1. Cache level 1 is always accessed.
• Every time a cache miss occurs at cache level L, the entire cache line requested
is loaded into the cache of level L.
Furthermore, we also assume that the cache size and block size of cache level L+ 1
is equal to or greater than that of cache level L [44], respectively.
Top-down Cache Hierarchy Analysis. For L1 cache, it services every memory refer-
ence request. However, this is not true for other levels of caches (L > 1), because cache
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 44
level L + 1 is only accessed when the memory reference is a cache miss in the cache
level L. Thus, given a memory reference m and its hit rate on cache level L, Hm[L], if
Hm[L] = 0, then cache level L + 1 is guaranteed to be accessed; if Hm[L] = 1, then
cache level L+ 1 is guaranteed not to be accessed; if 0 < Hm[L] < 1, then we need to
consider both two scenarios.
In section 4.3.2, we define E as the probabilistic cache state update operator. Given
a probabilistic cache state C, CEm returns the probabilistic cache state after the access
ofm. We also define merge operator
⊕
in section 4.4 and
⊕
(C1, . . . , Cn, w) will return
a merged probabilistic cache state of C1, . . . , Cn based on the weight function w.
Now, let CinL and CoutL be the probabilistic cache state of cache level L(L ≥ 1) before
and after the access of memory block m, respectively. In order to handle multi-level
caches, we define a new probabilistic cache state update operator .
CoutL = CinL m =

CinL Em if L = 1⊕
(CinL , CinL Em,w) where otherwise
w(CinL ) = Hm[L− 1] and w(CinL Em) = 1−Hm[L− 1]
In other words, given a memory block reference m, for cache level L + 1, we need
to consider two scenarios — it is accessed (m is a cache miss for cache level L) and
it is not accessed (m is a cache hit for cache level L). Then, two output cache states
corresponding to the two scenarios are merged based on their probabilities. Thus, we
need to do a top-down cache hierarchy analysis as shown in Figure 4.4. We start withL1
cache analysis. After L1 analysis, the probabilistic L1 cache state is available at every
point of program and the hit/miss probability of each memory access can be computed.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 45
L1 Access 
L1 Cache Analysis 
 





L1 Hit Probability  
L2 Probabilistic 
Cache State 
L2 Hit Probability  
Figure 4.4: Top-down cache hierarchy analysis
Then, we proceed to L2 cache analysis and compute the corresponding cache hit/miss
rate and continue this analysis to the memory.
Cache Hits Computation. Let us recall that we use B to represent the set of the
basic blocks of the program, MB and NB to represent the sequence of memory blocks
accessed in B and the execution count of basic block B. We use H[L] to denote the












where Hm[L] is the cache hit rate of memory reference m in cache level L. Hm[L] is
available after the top-down cache hierarchy analysis and NB can be obtained through
profiling. Now, the number of accesses to cache level L is I −∑1≤i<LH[i], where I is
the number of dynamic executed instructions.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 46
4.6 Experimental Evaluation
In this section, we evaluate the accuracy and efficiency of our static cache analysis by
comparing it with cache simulator Dinero-IV [29]. Dinero-IV is a widely used trace
driven cache simulator, but it only simulates one cache configuration at a time. Thus, to
explore a design space with multiple cache configurations, Dinero-IV has to be invoked
for each configuration. We will present the results of level-1 cache first, followed by a
two-level cache hierarchy results.
We select 10 programs from MiBench [46]. We fix a line size for each benchmark.
The line size for each benchmark is selected such that the cache hit rate has a wide
coverage. The benchmarks, corresponding line size, and trace size are shown in Table
4.1. For trace-driven simulation, trace size can be quite large even for small programs as
shown in column Trace. On the contrary, the inputs to our analysis are just basic block
and control flow edge execution counts, whose size is so small that it can be ignored.
We use SimpleScalar toolset [14] for the experiments. We instrument its functional
simulator sim-profile to collect execution count of basic blocks and control flow edges.
The time spent in our instrumentation during the functional simulation is shown in
column Prof. In order to generate the address trace, the simulator sim-profile is instru-
mented to output the execution trace and the time spent in the trace collection is shown
in column Trace. Our estimator first disassembles the executable to construct CFG and
LPHG, and then proceeds with the cache hit estimation. Similar to simulation, our esti-
mation has to be repeated for each cache configuration to obtain the cache hit rate. For
estimation, a special version is implemented for direct mapped cache (A = 1) based on
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 47
Benchmark Line Size Trace Size Time(sec)
(Byte) (MB) Prof Trace Diero-IV Analysis Ratio
bitcount 8 3583 17.04 313.998 8897 0.132 67401.52
dijkstra 8 4700 9.05 437.111 8789 1.707 5148.8
adpcmdec 8 791 3.22 76.852 2523 1360.413 1.85
adpcmenc 8 961 4.1 94.008 2487 3623.264 0.69
sha 8 706 0.69 65.137 1818 0.159 11433.96
rijndael 32 1600 0.99 136.072 3770 0.157 24012.74
susans 8 4206 5.7 392.494 8916 1.105 8068.78
susanc 16 896 0.25 83.433 2251 inf inf
gsmenc 16 2089 2.01 215.035 5357 2.638 2030.71
gsmdec 16 1800 8.29 154.505 4159 4.956 938.18
Table 4.1: Benchmarks characteristics and runtime comparison of Dinero and our analysis.
the closed forms presented in Section 4.4.3. We perform all experiments on a 3GHz
Pentium 4 CPU with 2GB memory.
4.6.1 Level-1 Cache
For level-1 cache, we vary the number of cache sets from 4 to 64 and associativity from
1 to 8. That is, a total of 20 cache configurations are estimated and simulated. The cache
hit rates of both simulation and estimation are shown in Figure 4.5. The horizontal axis
represents the 20 cache configurations where a × b represents the cache with a cache
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 48
sets and b associativity. For benchmark susan-c, estimation fails to return cache hit
rate for a few configurations (A = 8) even after running for 10 hours as highlighted in
the Figure 4.5. For the others, the estimation results from analysis track the simulation
results quite closely. For all the benchmarks and cache configurations, we achieve high
accuracy (0.07% average error). The error is defined as |est− sim| where est(sim) is
the estimated (simulated) cache hit rate.
In section 4.4.2, we mentioned that the concatenation operator does not need to be
invoked forN (loop bound) times, because the probabilistic cache state could converge.
Here, we support this claim by concrete experimental results. Figure 4.6 shows the
distributions of the number of iterations that the cache sets take to converge. More than
80% of the cache sets converge after the second iteration for all associativity settings
(for all loops in all our benchmarks).
Simulation and estimation time are shown in column Dinero-IV and Analysis in Ta-
ble 4.1. Ratio is defined as Dinero−IV
Analysis
. As shown, for most benchmarks, our analysis
is much faster than simulation. However, we observe that there are a few exceptions.
For adpcmdec, the estimation is slightly faster than simulation; for adpcmenc, the esti-
mation is slower than simulation; for susanc, the estimation fails to return results for a
few configurations even after running for 10 hours. The slowness of estimation for the
above benchmarks are mainly due to the highly associative cache configurations (e.g.
A = 8). For those highly associative caches, the number of concrete cache states in a
probabilistic cache state could be big. For example, for some loops and cache sets, their
“average” probabilistic cache states (CavgL ) contain hundreds of concrete cache states for











































































































































































































































































































































































































































































































































































































































































































































































































































Figure 4.5: The estimation vs simulation of cache hit rate across 20 configurations.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 50























Figure 4.6: Cache set convergence for different values of associativity.
adpcmdec. As a result, the probabilistic cache states concatenation operation could take
long to finish as it involves the concatenation of all the combinations of concrete cache
states across two probabilistic cache states. Furthermore, probabilistic cache state may
not converge fast given such complicated concrete cache states composition. For most
of the cache sets, when the associativity is high (A = 8), the concatenation has to be
repeated until the loop bound is reached. In the next chapter, we will propose a few
optimizations to solve this problem.
Analysis Sensitivity. In the above experiments, the estimation is based on the profiles
(basic block and control flow edges execution counts) of one input. The simulation re-
sults are collected using the same input as estimation. Here, we evaluate the sensitivity
of the profiles across different inputs. More clearly, we use the profile from one input
and estimate the cache hit rate. Then, we compare it with the simulation result of an-
other input. The comparison result is present in Figure 4.7. As shown, the estimation
predicts the simulation very well even if the the profiles are from different inputs. Over-
all, for all the benchmarks and cache configurations, we achieve high accuracy (0.91%











































































































































































































































































































































































































































































































































































































































































































































































































































Figure 4.7: The estimation vs simulation of cache hit rate across 20 configurations. Estimation
is based on the profiles of an input different from simulation input.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 52
average error). The error is defined as |est − sim| where est(sim) is the estimated
(simulated) cache hit rate.
4.6.2 Multi-level Caches
The technique presented in section 4.5 can be applied to cache hierarchy with any num-
ber of cache levels. But in the following, we will consider a two-level caches. As for
the cache design parameters, we choose realistic parameters to reflect the typical design
of embedded systems. For L1 cache, we consider 1KB and 2KB size cache, 16 and
32 block (line) size and direct mapped and 2 way set associativity. For L2 cache, we
explore 1K, 2K, 4K and 8K size cache, 16 and 32 block size and direct mapped and 2
way set associativity. However, in reality, the block size and cache size of L2 must be
equal to or greater than that of L1 cache, respectively. Thus, there are totally 84 cache
configurations. We have shown that our cache modeling is accurate in terms of cache
hit rate. In the following, we will validate the accuracy of our modeling in terms of both
performance and energy consumption.
Performance and Energy Model. As for the cache access latency, we assume 1 cycle
latency for L1 access and 4 cycles latency for L2 access. The main memory access is
considered to be pipelined. The first access to main memory is 18 cycles, while the
the subsequent accesses take 2 cycles. For different cache configurations, we model
the energy consumption of the memory hierarchy using the CACTI [89] model for
0.13µm technology. For the memory energy consumption, our focus is dynamic energy
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 53
consumption. As for the energy consumption for one access to memory, it is assumed
to be 50 times of energy consumption of one access to L2 cache [106].
Design Space. The entire design space of caches (84 configurations) is explored via
both simulation (Dinero) and estimation (our analytical modeling). The performance
and energy numbers are computed using the cache hits/misses returned by simulation
and estimation, respectively. Each design point represents a cache configuration (L1
and L2 cache parameters) and it is associated with two numbers — performance and
energy consumption. The entire design space for both simulation and estimation are
shown in Figure 4.8. As shown, our estimation is close to simulation results in terms of
both individual points and entire space.
Pareto-optimal Points. We are only interested in the pareto-optimal points in the
design space. Each pareto-optimal point represents cache hierarchy parameters (both
L1 and L2). Once these points are selected either via simulation or estimation, em-
bedded system designers rely on the detailed simulation or execution to obtain more
accurate performance and energy numbers. We use Wattch [23] to measure the accu-
rate performance and energy. Wattach is a micro-architecture level energy/performance
simulator [23]. However, Wattch does not model the energy consumption of memory.
We extend it to include the energy consumption of memory component. We consider
an in-order issue processor. Our focus is instruction cache, so we disable the data cache
component in the Wattch simulator.
















































































































































































Figure 4.8: Performance-energy design space and pareto-optimal points for both simulation
and estimation.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 55
Now, we have two sets of pareto-optimal points — one from simulation and the
other one from estimation. To compare these two sets of pareto-optimal points, we rely
on the metric in [107]. Let X ′, X ′′ be two sets of pareto-optimal points,
C(X ′, X ′′) =
|{a′′ ∈ X ′′;∃a′ ∈ X ′ : a′  a′′}|
|X ′′|
where a′  a′′ means a′ covers (dominate or equal) a′′. C(X ′, X ′′) is in interval [0, 1],
where C(X ′, X ′′) = 1 means that all solutions in X ′′ are covered by solutions in X ′;
C(X ′, X ′′) = 0 means that none of the solutions in X ′′ are covered by the set X ′. Let
sim, est be the two sets of pareto-optimal points for simulation and estimation, respec-
tively. Then, we are interested in C[est, sim]. For all the benchmarks, C[est, sim] = 1.
In other words, all the exact solutions (configurations from simulation) are covered by
solutions (configurations) returned by our estimation.
4.7 Summary
In this chapter, we present an analytical approach to derive the cache hit rate for a
specific cache configuration. In particular, this is achieved by the introduction of prob-
abilistic cache state, its operations and the static program analysis. We also extend our
analysis to memory hierarchies with more than one level of caches. We have conducted
an experimental study to compare the accuracy and efficiency of our analysis and trace
driven simulation. Experiments indicate that our analysis is very accurate. Overall our
analysis is shown to be efficient but with a few exceptions. We will propose a solution
to solve this problem in the next chapter.
CHAPTER 4. CACHE MODELING VIA STATIC PROGRAM ANALYSIS 56
Furthermore, our cache analysis can be easily extended to multi-tasking embedded
system environment. Given a task, for its cache analysis, the input cache state should
be the footprint of previous task instead of the empty cache state.
Chapter 5
Design Space Exploration of Caches
In this chapter, we extend our cache analysis in chapter 4 to multiple cache configura-
tions in one pass. We achieve this by exploiting the inclusion property among related
cache configurations. We extend probabilistic cache state to probabilistic General Bi-
nomial Tree and enhance it with efficient operators. We compare our analysis with
Cheetah [90], a state-of-the-art cache simulator that can simulate multiple cache con-
figurations in one pass.
5.1 Introduction
In chapter 4, we have introduced the concept of probabilistic cache states, which cap-
tures the set of possible cache states at a program point along with their probabilities.
We have also proposed a static program analysis to compute the probabilistic cache
state. Obviously, our proposed static program analysis can be used for cache design
57
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 58
space exploration. However, when employed in the context of design space exploration,
the runtime of our static program analysis is not competitive compared to state-of-the-
art cache simulators such as Cheetah [90]. This is because fast cache simulators employ
single-pass simulation that estimates the hit rates for a large number of cache configu-
rations in one pass. In contrast, our technique in chapter 4 has to estimate each cache
configuration individually leading to overall slower design space exploration.
We observe that if we can extend our technique to model multiple cache configura-
tions in one pass, we get a very powerful tool for design space exploration. Thus, we
extend the concept of probabilistic cache states to achieve this goal. We borrow the data
structure, called Generalized Binomial Tree (GBT), proposed by Sugumar and Abra-
ham [90] to exploit the inclusion property among related cache configurations. GBT
enables us to capture the cache states corresponding to a number of related configura-
tions in one succinct representation. However, as a program point can be reached from
different contexts, we may have a number of GBTs, each associated with the probabil-
ity of the corresponding context. Therefore, we propose probabilistic GBT to capture
the cache states corresponding to all cache configurations and all contexts at any pro-
gram point. Cache state operators such as update and concatenation are extended for
probabilistic GBT. As for the underlying static program analysis, it almost remains the
same as that of single configuration. Thus, we can derive the probabilistic GBT at each
point of the program as before. Now, given a probabilistic GBT, we can easily estimate
the cache hit rate of a memory access and entire program for all possible cache con-
figurations. However, maintaining these probabilistic GBTs and operating on them can
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 59
become space and time inefficient as the number of contexts increases. Therefore, we
propose a number of optimizations for space and time efficiency.
5.2 General Binomial Tree (GBT)
To exploit the inclusion property among related cache configurations, we rely on Gen-
eral Binomial Forest (GBF) data structure. Let us explain the GBF data structure with
an example. Again, we consider LRU as the cache replacement policy. Figure 5.1(a)
shows, for the same memory address trace, the contents of six caches with number of
sets = 1, 2, 4 and associativity = 1, 2. From the example, we observe that for the caches
with the same associativity, the memory blocks in the cache with 2(1) sets are included
in the cache with 4(2) sets. For the caches with the same number of sets, the memory
blocks in the cache with associativity 1 are included in the cache with associativity 2.
GBF exploits the aforementioned inclusion property that holds between cache con-
figurations. Let us denote a set-associative cache with 2S sets, line size L, and associa-
tivity N as CLS (N). A GBF can represent a set of cache configurations {CLS (n)|Smin ≤
S ≤ Smax;n ≤ N}, where 2Smin (2Smax) is the minimum (maximum) number of sets
among the group of cache configurations and N is the maximal associativity.
A GBF consists of one or more Generalized Binomial Trees (GBT). A GBT can be
defined recursively as follows. A GBT of degree 0 is a list of length N and the elements
in the list are ordered according to LRU policy (i.e., the top element is the most recently
accessed address, while the bottom element is the least recently accessed address). A






























(a) Cache content (b) General Binomial Forest Construction
associativity
Memory trace 0000,0010,0110,1011,1100,1001
Figure 5.1: Cache content and construction of generalized binomial forest. Memory blocks are
represented by tags and set number, for example, for memory block 11(00), 00 denotes the set
and 11 is the tag.
GBT of degree k is constructed by linking two GBTs of degree k− 1 together, with the
most recently accessed N references in either root lists of the two GBTs as the new root
list. By definition, a GBT of degree k has 2k ·N nodes.
Let us explain the construction of GBF based on the example shown in Figure 5.1.
The GBF for the cache configuration CL2 (2) consists of 4 GBTs of degree 0 (one corre-
sponding to each set). We use⊥ to denote an empty cache block. The GBF for the cache
configuration CL1 (2) contains 2 GBTs of degree 1 (one corresponding to each set). The
GBT for a set s in CL1 (2) is obtained by linking two GBTs of C
L
2 (2) that map to the set
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 61
s. For example, the memory blocks in set 0 and 2 of CL2 (2) map to set 0 of C
L
1 (2). They
are merged together with the most recently accessed 2 references as the new root. The
merging is done similarly for set 1 inCL1 (2). This process is continued until the GBF for
the cache configuration with the minimum number of sets CL0 (2) is constructed. Now
the contents of all the cache configurations in the set {CLS (n)|0 ≤ S ≤ 2;n ≤ 2} can
be found in the GBF for the cache configuration CL0 (2). A detailed description of GBT
as well as their search and update procedure can be found in [90].
Array Implementation. We use an array based implementation of GBT [90]. Let us
assume the degree of GBT as M . The GBT is implemented as a two-dimensional array
with 2M+1 − 1 rows and N columns. The rows are divided into M + 1 levels from 0 to
M and level k has 2k rows. As discussed before, a GBT of degree M has 2M ·N nodes.
Thus, array implementation has about a factor of two redundancy.
Figure 5.2 shows an example of the array implementation of GBT, whereM = 2 and
N = 2. Given a node t in the GBT, we use des(t) to denote the number of descendants
(inclusive) of node t. The rank of a node is defined as log(ddes(t)
N
e). Memory block at a
node of rank k maps to level M − k and the row within the level is determined by the
least significant M − k bits of the memory block address. There are at most N memory
blocks in the same row and they are arranged in the order in which they have been
accessed (i.e., the leftmost memory block is the most recently used, while the rightmost
memory block is the least recently used).
Given an incoming memory block address address, the search and update proce-
dure of GBT starts from the top level and only one row in each level is checked. The
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 62













0 0 0 0
Figure 5.2: Mapping from GBT to array. The nodes in GBT are annotated with their ranks.
row examined in level k is determined by the least significant k bits of address and the
tag matches are done with the memory blocks in that row. For example, in Figure 5.2,
suppose we are searching for address 0101. We first examine 1001 and 1100 in level 0.
Then, in level 1, the address 0101 maps to row 1 and so 1011 is examined. Finally, in
level 2, the address 0101 maps to row 1 and it is found there.
Cache Hits Computation. A two dimension array hit is used for storing the cache
hits for multiple cache configurations. Array hit will be updated if a memory block
is cache hit, and the corresponding entries will be increased by 1. However, hit[m][n]
only stores the number of references that hit in cache configuration CLm(n) but miss in
smaller caches CLm(n
′) where n′ < n. According to the inclusion property related to
associativity, the number of hits in CLm(n) can be computed by summing up the hits of
itself and those from smaller caches as
∑n
i=1 hit[m][i].
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 63
5.3 Probabilistic GBT
We now describe the probabilistic cache modeling based on General Binomial Forest
(GBF). The multiple cache configurations we support are constant line size, varying
number of cache sets and degree of associativity. Based on the description in Section
5.2, we are interested in the set of configurations {CLS (n)|Smin ≤ S ≤ Smax;n ≤ N},
where 2Smin (2Smax) is the minimum (maximum) number of cache sets and N is the
maximum associativity.
Assumptions. For the set of cache configurations above, we will have 2Smin GBTs
with degree Smax − Smin in the GBF. However, one memory block maps to only one
GBT based on its index in CLSmin(N). Thus, there is no interference among different
GBTs. Thus, we assume Smin = 0. In other words, there is only one GBT of degree
Smax in the GBF. For the configurations with more than one GBTs, each GBT can be
modeled independently.
More concretely, in the following, we consider a GBT of degree M(Smax) and root
list length as N . To indicate the absence of any memory block in a cache line, we
introduce a new element ⊥. We use Ω to denote the set of all the possible GBTs of the
program. We also introduce a special empty GBT c⊥.
At any program point, the GBT is determined by the program path taken before
reaching this program point. Usually a program point can be reached via multiple
program paths leading to a number of possible GBTs at that point. Thus, we introduce
the notion of probabilistic GBT.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 64
Definition 10 (Probabilistic GBT). A probabilistic GBT C is a 2-tuple: 〈C,X〉, where
C ∈ 2Ω is a set of GBTs and X is a random variable. The sample space of the random
variable X is Ω. Given a GBT c, we define Pr[X = c] as the probability of c in C.
If c /∈ C, then Pr[X = c] = 0. By definition, (∑c∈Ω Pr[X = c]) = 1. Finally, we
define a special probabilistic GBT C⊥ denoting the empty probabilistic GBT. That is
C⊥ = 〈{c⊥}, X〉, where Pr[X = c⊥] = 1.
We useC to denote GBT search and update operator. Given a memory block m and
a GBT c, c Cm returns the GBT after accessing m. Meanwhile, we define operator E
as the search and update operator of probabilistic GBT. Given a memory block m and a
probabilistic GBT C = 〈C,X〉, E will update each GBT c ∈ C and C Em returns the
updated probabilistic GBT. Also we extend operator
⊕
in chapter 4 to merge multiple
probabilistic GBT.
5.3.1 Concatenation of Probabilistic GBTs
In this subsection, we introduce the concatenation of probabilistic GBTs, which will
be used in the subsequent sections. The operator for concatenation of two GBTs,  is
defined in Algorithm 1. In the array based implementation of GBT, c2 is a multilevel
two-dimensional array. The concatenation is done by using the memory blocks in c2
from the bottom level to top level and from right to left to update c1. In other words,
the update is done from the least recently used to most recently used memory blocks
of c2. An example of GBT concatenation is shown in Figure 5.3. Let us assume the
GBT after the first and second memory traces are c1 and c2, respectively. Then the GBT
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 65
Algorithm 1: Implementation of  operation
input : GBT c1 and c2
output: c = c1  c2
c = c1;1
for lev ←M to 0 do2
Let T be the two dimension array at level lev in c2;3
foreach row ∈ T do4
for col← N to 1 do5
if T [row][col] 6=⊥ then6





after accesses corresponding to the two memory traces sequentially is c1 c2. Next, we
extend the concatenation operation to probabilistic GBTs.
Definition 11 (Concatenation of Probabilistic GBTs). Given probabilistic GBTs C1 =
〈C1, X1〉 and C2 = 〈C2, X2〉
C1
⊙ C2 = C where C = 〈C,X〉
C = {c|c = c1  c2, c1 ∈ C1, c2 ∈ C2}
Pr[X = c] =
∑
c1∈C1,c2∈C2,c=c1c2
(Pr[X1 = c1]× Pr[X2 = c2])














c1 c2 c1 110 111 c3= =
Figure 5.3: Concatenation for GBTs where M = 1 and N = 2.
Let us assume the execution of two program fragments sequentially each starting
with an empty GBT. The probabilistic GBT after the execution of the first and sec-
ond program fragments are C1 and C2, respectively. Then the probabilistic GBT after
execution of the two program fragments sequentially is C1
⊙ C2.
5.3.2 Combining GBTs in a Probabilistic GBT
A program path can be specified by the basic block sequence. Although multiple paths
could reach a program point, they probably traverse some common basic block sub-
sequence. Thus, the set of GBTs in a probabilistic GBT can include some identical
memory blocks. By combining the similar GBTs together, we can reduce the space
requirement of probabilistic GBTs. More importantly, the search and update of proba-
bilistic GBTs will be much faster.
In the array based implementation, GBT is divided into M + 1 levels. We combine
the GBTs level by level from top to bottom. More concretely, given two GBTs, if the
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 67
(a) Combination of probabilistic GBT >=< XccC },,{ 21
100 011
110




Pr [X=c1] = 0.5
Pr = 0 5 Pr = 0 5
101Pr [X=c2] = 0.5
  .   .
100 011Pr = 1






P 0 25 P 0 25 P 0 5r = . r = . r = .
Figure 5.4: Probabilistic GBT combination and concatenation.
content of the top k (k ≤ M + 1) levels are identical, then they are combined together
to have only one copy of the top k levels as shown in Figure 5.4 (a). Also as the GBTs
are combined together, the probabilities are now associated with each level rather than
with the GBTs.
It is possible to perform combination at finer granularity, for example, using rows
rather than levels. However, the complexity of the combination process increases con-
siderably leading to slower implementation. It is also possible that two GBTs are differ-
ent at the top levels, but they are identical at the bottom levels. We choose not to perform
combination for such GBTs. This is because, as the probabilistic GBT is updated, the
contents from the upper levels move to the lower levels. Thus the commonality among
the GBTs are lost and they have to be split again. It is far more efficient to combine
GBTs only if they are identical at the top levels.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 68
The implementation of a combined GBT can be viewed as a tree with the sub-arrays
(levels) of the original GBTs as nodes (see Figure 5.4(a)). The sub-array corresponding
to the common top levels 0 − k is the root node of this tree. Level k, however, has
multiple children at level k + 1. Now the search and update of probabilistic GBTs
become more efficient. Consider a memory block m that is present somewhere in the
top k levels. Without combination, m will be searched in all the original GBTs; now
it will be searched only once in the combined GBT. For example, in Figure 5.4(a),
before combination, the reference to memory block 100 is searched in both c1 and c2.
With combined GBT, it is only searched once. In Figure 5.4(b), we show the combined
probabilistic GBT after concatenation operation.
5.3.3 Bounding the size of Probabilistic GBT
We observe that, in a probabilistic GBT, some of the constituent GBTs have very low
probabilities. That is, these GBTs correspond to rare program paths. Based on this
observation, we prune some of the GBTs for space and time efficiency.
We define the metric dist for pruning. Consider a combined GBT with two nodes
at level k. Each node is a two dimension array with 2k rows and N columns. Given two
such nodes n1, n2 at the same level, we define d(n1, n2) as the measure of the distance
between them. It is defined as a function of the number of different memory blocks
between them. But higher priority is given to the more recently used memory blocks as
shown in Equation 5.1.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 69
Level k
Pr[m] < Te+ P [ ]Level k+1 r mm1 m
Level k+2





N − j + 1, if n1[i][j] 6= n2[i][j]
0 otherwise
(5.1)
We apply two merging strategies. First, if the probability of a node n is too small
(< Te), then the subtree rooted at n is pruned. But its probability is added to the
subtree rooted at the closest sibling of n (the closest is defined by the dist metric).
Second, if the number of children of a node exceeds a pre-defined limit Z, then Z
children with highest probability are kept and the subtrees rooted at the rest of the
children are pruned. As before, the probability of each pruned child is added to its
closest surviving sibling defined by the dist metric. The pruning process continues
from top to bottom. As shown in Figure 5.5, the subtree rooted at m (including m)
is pruned because its probability is too small. However, its probability is added to the
subtree rooted at m1, which is the closest sibling of m. Similar pruning strategy can be
applied across independent or merged GBTs in a probabilistic GBT. In practice, we set
Te to 10−6 and Z to 4.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 70
5.3.4 Cache Hit Rate of a Memory Block
Recall that in section 5.2, if a memory block m results in a cache hit, the correspond-
ing entries in the array hit are incremented by 1. However, in our probabilistic cache
modeling, we get a cache hit probability by looking up the probabilistic GBT. The hit
probability is simply the sum of the probabilities of all the nodes where m can be found
in the probabilistic GBT. Now we add this hit probability to the hit array.
For memory block m, we can get its hit rate Hm for different cache configurations
if the probabilistic GBT at that program point is known. Then the cache hit rate of the
whole program can be derived from Equation 4.1. Now we present our static analysis
method to derive the probabilistic GBTs at every program point.
5.4 Static Cache Analysis
The static cache analysis remains almost the same as the one presented in chapter 4.
All the probabilistic cache state operators such as update, merging, concatenation have
been extended for probabilistic GBT. The analysis for DAG and loop iterations, and the
whole program are not changed. The only exception is that we relax the convergence
constraint in applying concatenation for loop iterations. For iteration n and n + 1,
if the difference of probabilities between every pair of identical GBTs in Coutend〈n〉 and
Coutend〈n+1〉 are within Te, we declare convergence. Experimental results confirm that
convergence is reached quickly for most of the loops in all the benchmark programs.
In the worst case, concatenation operations is terminated at a pre-defined threshold of
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 71
MaxN iterations. The average probabilistic GBT across theseMaxN iterations is used
as an approximation of the average probabilistic GBT across NL iterations. In practice,
we set MaxN to 100 and Te to 10−6.
5.5 Experimental Evaluation
We evaluate the accuracy and efficiency of our analytical approach by comparing it with
cache simulator Cheetah [90]. Cheetah is the fastest known cache simulator, which can
simulate multiple cache configurations in a single pass. Experiment settings are the
same with chapter 4. The cache hit rates of both simulation and estimation are shown
in Figure 5.6. The horizontal axis represents the 20 cache configurations where a × b
represents the cache with a cache sets and b associativity. The estimation results are
very close to simulation results. For all the benchmarks and cache configurations, we
achieve high accuracy (0.7% average error). The error is defined as |est − sim| where
est(sim) is the estimated (simulated) cache hit rate.
The estimation and simulation time are shown in Table 5.1. Column Multiple Passes
Analysis shows the estimation time of the cache modeling of chapter 4. Column Sin-
gle Pass Analysis represents the time spent in current cache modeling, which estimates
multiple cache configurations in a single pass. Our analysis is significantly faster (24–
3,855 X speedup) compared to Cheetah simulation. In addition, compared to the mul-
tiple passes analysis in presented in chapter 4, our current single pass analytical model
is much more efficient.











































































































































































































































































































































































































































































































































































































































































































































































































































Figure 5.6: Estimation vs simulation across 20 configurations.











































































































































































































































































































































































































































































































































































































































































































































































































































Figure 5.7: Estimation vs simulation across 20 configurations. Estimation is based on the
profiles of an input different from simulation input.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 74
Benchmark Time(sec)
Cheetah Single Pass Analysis Multi Passes Analysis Ratio
bitcount 138.8 0.036 0.132 3855.56
dijkstra 143.87 0.298 1.707 482.79
adpcmdec 33.055 0.086 1360.413 384.36
adpcmenc 41.321 0.197 3623.264 209.75
sha 21.524 0.063 0.159 341.65
rijndael 32.827 0.065 0.157 505.03
susans 118.9 0.268 1.105 443.66
susanc 28.234 0.577 inf 48.93
gsmenc 67.65 1.777 2.638 38.07
gsmdec 35.26 1.462 4.956 24.12
Table 5.1: Runtime comparison of Cheetah simulator and our analysis. Simulation time is
shown in Column Cheetah. Ratio is defined as CheetahSinglePassAnalysis .
Analysis Sensitivity. In the above experiments, the estimation is based on the profiles
(basic block and control flow edges execution counts) of one input. The simulation
results are collected using the same input as estimation. Here, we evaluate the sensitivity
of the profiles across different inputs. More clearly, we use the profile from one input
and estimate the cache hit rate. Then, we compare it with the simulation result of
another input. The comparison result is present in Figure 5.7. As shown, the estimation
predicts the simulation very well even if the the profiles are from different inputs.
CHAPTER 5. DESIGN SPACE EXPLORATION OF CACHES 75
5.6 Summary
In this chapter, we propose an analytical approach for rapid and accurate design space
exploration of instruction caches. In the end, our analysis method can estimate the hit
rates for all cache configurations with varying number of sets and associativity in one
pass as long as the cache line size remains constant. We achieve this by exploiting
the inclusion property among related cache configurations. The input to our analysis
is simply the basic block and control flow edge execution count profiles, which is sig-
nificantly more compact compared to memory address traces required by trace-driven
simulators and other trace based analytical works. Our experimental evaluation for a
number of embedded benchmarks reveals that our estimation is highly accurate and our
single-pass cache analysis is efficient compared to the fastest known single-pass cache
simulator Cheetah.
Now, given a cache design space, our analysis can explore this space efficiently
and return the cache hit/miss for every cache configuration. Then, given certain design
objectives, such as performance, energy consumption, or a combination of the two,
the cache hit/miss returned by our analysis will be used by various performance and
energy model to select the best cache configurations correspondingly. Compared to
the baseline cache configuration (direct mapped or fully associative caches), the best
cache configurations will improve the cache hit rate. As a result, both the performance
and energy consumption are improved as cache misses consume more power and incur
longer delay than cache hits.
Chapter 6
Instruction Cache Locking
In this chapter, we consider cache locking to further optimize the performance of in-
struction cache. We will show that cache locking can be quite effective in improving
the average-case execution time of general embedded applications. We first introduce
temporal reuse profile to accurately and efficiently model the cost and benefit of locking
memory blocks in the cache. Then, we propose an optimal algorithm and a heuristic
approach that use the temporal reuse profile to determine the most beneficial memory
blocks to be locked in the cache.
6.1 Introduction
Most modern embedded processors (e.g., ARM Cortex series processors) feature with
cache locking mechanisms whereby one or more cache blocks can be locked under
software control using special lock instructions. Once a memory block is locked in
76
CHAPTER 6. INSTRUCTION CACHE LOCKING 77
the cache, it cannot be evicted from the cache under replacement policy. Thus, all
the subsequent accesses to the locked memory blocks will be cache hit. Only when
the cache line is unlocked, the corresponding memory block can be replaced. Cache
locking was initially designed to improve the timing predictability of hard real-time
embedded systems. As the cache content is known statically, the memory access time
of each reference can be determined accurately leading to tighter worst-case execution
time (WCET) analysis. Hence, most cache locking algorithms proposed in the literature
target to improve the WCET of the application.
However, cache locking has the potential to significantly improve the average-case
performance of a general embedded application. This can be achieved by systematically
eliminating conflict misses in the cache through locking. For example, consider two
memory blocks m0 and m1 that are mapped to the same cache set and the sequence of
memory block accesses is (m0m1)10. Given a direct mapped cache, all the accesses will
be cache miss (20 misses) asm0 andm1 replace each other from the cache alternatively.
However, if eitherm0 orm1 is locked in the cache, then the total number of cache misses
can be reduced to 10. Note that locking a block in a cache set can negatively impact the
performance of the remaining memory blocks mapped to the same set as the effective
cache capacity gets reduced. Therefore, any cache locking algorithm should carefully
balance the cost and benefit of locking.
In this chapter, we explore instruction cache locking to improve the average-case
execution time. Recently, Anand and Barua [12] have presented an instruction cache
locking heuristic with the same objective. Their experiments confirm that locking is
CHAPTER 6. INSTRUCTION CACHE LOCKING 78
beneficial in improving average case performance. However, there are two major draw-
backs in their work. First, they propose an iterative approach where detailed cache
simulation is employed in every iteration to evaluate the cost/benefit of locking the
memory blocks. Thus the algorithm is quite inefficient specially for large benchmarks.
Moreover, they employ some approximations in the cost/benefit analysis to cut down
simulation cost leading to inaccuracy.
We introduce temporal reuse profile (TRP) to accurately and compactly capture
the cost/benefit of locking each memory block. TRP is significantly more compact
compared to memory traces. We propose two cache locking algorithms based on TRP:
an optimal algorithm based on branch-and-bound search and a heuristic approach. We
show that our cache locking heuristic improves the state of the art [12] in terms of both
performance and efficiency and achieves close to the optimal result. We also compare
cache locking with a complimentary technique called procedure placement which will
be detailed in chapter 7. The procedure placement techniques improve instruction cache
performance through procedure reordering such that the conflict misses in the cache can
be reduced. We show that procedure placement followed by cache locking can be an
effective strategy in enhancing the instruction cache performance significantly.
6.2 Cache Locking Problem
In this section, we formally define the cache locking problem. We only consider static
instruction cache locking in this thesis where the instructions are locked in the cache at
CHAPTER 6. INSTRUCTION CACHE LOCKING 79
the beginning of program execution and remain locked throughout the program execu-
tion. Note that the mapping of instructions to the cache sets depend on the code memory
layout. Inserting additional code for cache locking may tamper this layout. To avoid
this problem, we use the trampolines [24] approach. The extra code to fetch and lock
the memory blocks in cache are inserted at the end of the program as a trampoline. We
leave some dummy NOP instructions at the entry point of the program that get replaced
by a call to this trampoline after locking decision are made. As we are considering static
cache locking, the cost of executing the trampoline is negligible and we will ignore this
overhead in the rest of the discussion.
Cache Terminology. Recall that we use L to represent block or line size, K to repre-
sent number of sets, and A to represent associativity. Now the cache size is defined as
(K×A×L). In this work, we consider Least Recently Used (LRU) replacement policy
where the block replaced is the one that has been unused for the longest time.
Two locking mechanisms are commonly used in modern embedded processors —
way locking and line locking. In way locking, particular ways of a set associative cache
are selected for locking and these ways are locked for all the cache sets. Way-locking
is employed by ARM processor series [4, 6]. Compared to way locking, line locking
is a finer grained locking mechanism. In line locking, different number of lines can be
locked for different cache sets. Line locking is employed by Intel’s Xcale [1], ARM9
family and Blackfin 5xx family processors [2]. We assume the presence of line locking
mechanism in this work.
CHAPTER 6. INSTRUCTION CACHE LOCKING 80
Cache misses can be broadly categorized into cold (compulsory) misses, capacity
misses, and conflicts misses. Cold misses are caused by the first reference to a mem-
ory block. Cache locking eliminates the cold miss, but at the same time introduces
additional overhead to fetch and lock the memory block at the beginning of program
execution (through the trampoline). We ignore the cost of execution of the trampoline,
which is negligible. Capacity misses are incurred due to the limited cache size and can-
not be mitigated through locking. Indeed, locking a memory block in the cache reduces
the cache capacity available to the remaining memory blocks and may negatively im-
pact their hit rates. So cache locking primarily targets to eliminate conflict misses while
minimizing the negative impact on the unlocked memory blocks.
Let T be the memory trace (sequence of memory block references) generated by
executing a program on the target architecture. We use Mi to denote the set of all the
memory blocks that are mapped to ith cache set Ci. Also given a memory block m, it
is only mapped to set (m modulo K). Thus, for any two cache sets Ci, Cj , we have
Mi ∩Mj = φ. That is, there is no interference between the cache sets and they can
be modeled independently to arrive at locking decisions. Therefore, the trace T can be
partitioned into K traces T1, . . . , TK — one corresponding to each cache set. The trace
Ti corresponding to cache set Ci only contains the memory blocks Mi from the original
trace T . Finally, given a memory block m ∈Mi, let us define the jth reference of m in
the trace Ti as m[j].
A memory block m benefits from cache locking as all its references will be cache
hits. It is straightforward to quantize this benefit of cache locking. Let access(m) be
CHAPTER 6. INSTRUCTION CACHE LOCKING 81
the total number of accesses to memory block m. Then by locking m, we will get
access(m) cache hits. That is
num hit(m) = access(m) if m is locked
However, locking memory block m ∈Mi in cache set Ci will have negative impact
on the other memory blocksMi\{m}. We will now proceed to characterize this negative
impact accurately.
Theorem 1. Given two memory blocks m,m′ ∈ Mi, if m[j] is a cache miss before
locking m′, then m[j] will remain a cache miss after locking m′ in cache set Ci.
Proof. The proof follows directly from the inclusion property for LRU replacement
policy. The inclusion property states that after any series of references, a smaller store
always contains a subset of the blocks in the larger store. After locking m′, the number
of available cache blocks in set Ci reduces by one. Clearly, if m[j] (jth reference of m
in the trace) was a miss (i.e., not present in the cache set) originally with more cache
blocks, it will be a miss with one less cache block.
Definition 12 (Temporal Conflict Set (TCS)). Given a memory referencem[j] (j > 1)
in the trace where m ∈Mi, its temporal conflict TCSm[j] is defined as the set of unique
memory blocks referenced betweenm[j−1] andm[j] in Ti. If there is no such reference,
then TCSm[j] = ∅.
For example, in Figure 6.1, the temporal conflict set of memory block m2 is {m1}
for its second reference and {m0} for its third reference. The temporal conflict set
determines whether the memory block reference will be a cache hit or a cache miss.
CHAPTER 6. INSTRUCTION CACHE LOCKING 82
Theorem 2. If |TCSm[j]| ≥ A for memory block m ∈ Mi, then the reference m[j] will
be a cache miss.
Proof. The proof follows directly from the definition of LRU replacement policy. As
we bring in A or more unique memory blocks into the cache set, memory block m will
be replaced from the cache and will incur miss in its next reference.
Moreover, following Theorem 1, if |TCSm[j]| ≥ A, then them[j] will be cache miss
irrespective of locking other memory blocks in the cache. Therefore, we can eliminate
TCSm[j] from further consideration as far as cache locking decisions are concerned.
For example, in Figure 6.1, the second reference to memory blockm1 is cache miss and
its temporal conflict set can be removed.
Let Locki be the set of memory blocks locked in cache set Ci. Clearly, |Locki| ≤ A.
Theorem 3. If |TCSm[j]| < A for m ∈ Mi\Locki, then m[j] will be cache miss only
when |Locki ∪ TCSm[j]| ≥ A.
Proof. As |TCSm[j]| < A, the reference m[j] will be cache hit in the original cache.
Now as we lock memory blocks into the cache set Ci, the space available to accommo-
date the unlocked cache blocks will reduce. m[j] will be cache miss when the number
of conflicting blocks and the locked blocks together exceeds the associativity of the
cache. That is, m[j] will be cache miss when |Locki ∪ TCSm[j]| ≥ A.
For example, in Figure 6.1, the second reference of memory block m2 will be cache
miss if m0 is locked, because |{m0,m1}| ≥ 2. However, it will remain as a cache hit if
m1 is locked.
CHAPTER 6. INSTRUCTION CACHE LOCKING 83
Hi Hit Hi






Memory blocks Access Temporal Reuse Profiles
m0 3
  
}1})({},{{ >=< φφ f
m1 3
m2 3
}1})({},{{ 00 >=< mfm
}1})({},{,1})({},{{ 1100 >=<>=< mfmmfm
Figure 6.1: Temporal reuse profiles from a sequence of memory access for a 2-way set associa-
tive cache. Memory blocks m0,m1 and m2 are mapped to the same set. Cache hits and misses
are highlighted.
Let Rm = {TCSm[j] : j > 1, |TCSm[j]| < A}, i.e., Rm is the set of TCS for
reference of m that result in hits in the original cache.
Definition 13 (Temporal Reuse Profile). The temporal reuse profile TRPm of a mem-
ory block m is defined as a set of 2-tuples {〈s, f(s)〉} where s ∈ Rm and f(s) denotes
the frequency of the temporal conflict set s in the trace.
Figure 6.1 shows an example of temporal reuse profiles given a trace of memory
block access. There are three memory blocks in the trace and the number of access
for each of them is collected. More importantly, for each memory block, only the TCS
which results in cache hits are kept in the profiles. The TCS of second reference of
memory block m1 is not kept, because it is a cache miss in a 2-way set associative
cache ( both m0 and m2 are in between).
CHAPTER 6. INSTRUCTION CACHE LOCKING 84
Given the temporal conflict profile for a program execution and the locked memory
blocks per cache set Locki : i = 1 . . . K, we can now accurately compute the number






f(s) if m /∈ Locki
access(m) otherwise





and the total number of cache hits for the entire program




Problem Statement. The goal of static instruction cache locking is to determine
the set of memory blocks to be locked per cache set Locki : i = 1 . . . K such that
num hit(T ) is maximized.
6.3 Cache Locking Algorithm
Our cache locking algorithm consists of two phases: profiling and locking.
Profiling Phase. The profiling phase creates the temporal reuse profile (TRP) for each
memory block in the program. This profiling can be achieved either by simulating the
application or by executing the application on the target platform with a representative
set of inputs. The simulation or execution creates the instruction address trace. The
temporal reuse profile is built by a single pass through the instruction address trace.
CHAPTER 6. INSTRUCTION CACHE LOCKING 85
Locking Phase. The locking phase determines the set of memory blocks to be locked
in each cache set such that the number of cache hits is maximized. We propose two
algorithms to select the memory blocks to be locked. One is an optimal solution based
on branch and bound search and the other one is an iterative heuristic.
6.3.1 Optimal Algorithm
Our optimal cache locking algorithm is presented in Algorithm 2. Cache sets are inde-
pendent. Thus, each cache set can be analyzed individually. For each memory block m,
we have to decide whether to lock it or not. Thus, the entire search space can be seen as
a binary decision tree. Algorithm 2 covers the entire search space and is guaranteed to
find the optimal solution. However, in the worst case, its complexity is as high as that
of exhaustive search.
Each level in the binary search tree corresponds to locking decision for one memory
block in the set. We obtain a solution when a leaf node is reached or the entire cache set
is locked (line 5). The number of cache hits is computed based on the cache modeling
described in Section 6.2 and only the best solution (the lock list) is kept (line 6-9). For
the current memory blockm, we consider two decisions for it — locked and not locked.
For each decision, we evaluate the upper bounds of cache hits and prune the search tree
if it is possible. At each level j (recursion depth), we use Curhit to represent the cache
hits from level 1 to j given the current lock list Lock. In other words, Curhit is sum of
cache hits for the memory blocks we have considered so far. Thus, Curhit is computed
CHAPTER 6. INSTRUCTION CACHE LOCKING 86
Algorithm 2: Optimal cache locking algorithm
foreach set Ci in the cache do1
optimalSoln := ∅;Maxhit := {num hit(Ti)|∅} ;2
search(Mi, ∅);3
Function(search(M, Lock))4
if |Lock| = A or M = ∅ then5
Newhit := {num hit(Ti)|Lock};6




Let m be any memory block from M ;11
Lock := Lock ∪m; ;12
M :=M\m; ;13





if Curhit+Bound > Maxhit then17
search(M,Lock); /* branching decision for m */18
search(M,Lock\m) ;19






where Mi represents the set of memory blocks mapped to cache set i and M is remain-
ing memory blocks for exploring. Also, Mi\M represent the set of memory blocks we
have considered so far.
Let Lock′ be the set of memory blocks to be locked in cache set Ci when exploring
the remaining memory blocks, where Lock′ ⊆ M . Clearly, |Lock ∩ Lock′| = ∅ and
|Lock ∪ Lock′| ≤ A.
Theorem 4. For memory block m ∈Mi\M ,
{num hit(m)|Lock} ≥ {num hit(m)|Lock ∪ Lock′}.
Proof. The proof follows directly from the inclusion property for LRU replacement
policy. The inclusion property states that after any series of references, a smaller store
always contains a subset of the blocks in the larger store. After locking additional
|Lock′| memory blocks , the number of available cache blocks in set Ci reduces. More-
over, m ∈ Mi\M and Lock′ ⊆ M , so m /∈ Lock′. Thus, on one hand, if the reference
of m was a cache miss under the lock list Lock, it will still be a cache miss under lock
list Lock ∪ Lock′. On the other hand, if the reference of m was a cache hit under the
lock list Lock, it might be downgraded to cache miss under lock list Lock ∪ Lock′ due
to the smaller available cache lines.
In other words, for a memory block m we have considered so far (m ∈ Mi\M ),
{num hit(m)|Lock} is the upper bound of the cache hits of m when exploring the
CHAPTER 6. INSTRUCTION CACHE LOCKING 88
remaining memory blocks. More importantly, Curhit is the upper bound of the total
number of cache hits for the memory blocks we have considered so far while exploring
the remaining memory blocks.
We define Bound as the maximum possible number of cache hits from the remain-
ing memory blocks with the existing lock list Lock. This bound is returned by function
ComputeBound. Obviously, if Curhit + Bound ≤ Maxhit, then the search space
rooted at current node will be pruned, where Maxhit represent the maximum number
of cache hits for all the solutions we have reached. To compute Bound, there are two
cases. On one hand, if there are enough cache lines for the remaining memory blocks
(|M | ≤ A−|Lock|) , then locking all the remaining blocks leads to all cache hits. Thus,
Bound is just the total number of access of the remaining memory blocks. On the other
hand, if there are more memory blocks than the available cache lines (M > A−|Lock|),
we need to derive the upper bound of cache hits (Bound) without knowing the memory
blocks to be locked. In the following, we will derive Bound for this case.
Let us assume M ′ be the set of remaining memory blocks to be locked which leads
to maximum number of cache hits for the remaining memory blocks, where M ′ ∈ M








{num hit(m)|Lock ∪M ′}
Finding M ′ is the exact problem we are trying to solve. During our branch and bound
search, we derive an upper bound for H without knowing M ′. Following Theorem 4,









Let Mf ∈ M be the set of A − |Lock| memory blocks with the maximum number
accesses and Mh ∈M be the set of |M |− (A−|Lock|) memory blocks with maximum
number of cache hits with lock list Lock. So, |Mf |+ |Mh| = |M |.














Proof. There are two cases.
• Mf ∩M ′ = ∅.
Let Mt be the set of |M ′| memory blocks with maximum number access. Obvi-
ously, Mt ⊆ Mf because |Mf | = A − |Lock| and M ′ ≤ A − |Lock|. We define



















For any memory block m,
hit(m) ≤ access(m) (6.2)
CHAPTER 6. INSTRUCTION CACHE LOCKING 90



















































|M\(M ′ ∪Mo)| = |M | − |M ′ ∪Mo|
= |M | − |M ′| − |Mo| (Mo ∩M ′ = ∅)
= |M | − |M ′| − |Mf |+ |Mt| (Mo = Mf\Mt)
= |M | − |Mf | (|Mt| = |M ′|)
= |Mh|
(6.6)







CHAPTER 6. INSTRUCTION CACHE LOCKING 91














• Mf ∩M ′ 6= ∅.
Let Mf ∩M ′ = Mu. Then, Mu on the two sides of inequalities in Theorem 5
will cancel each other. After that, it will be reduced to the same problem as above
case.
Following Theorem 5, Algorithm 3 presents the computation of theBoundwhich is
the maximal number of hits for the remaining memory blocks M given the current lock
list Lock. In other words, Bound is estimated by adding the accesses of A − |Lock|
memory blocks with the maximum number of accesses (corresponds to locking the
remaining cache lines with most profitable memory blocks) and the hits of |M | − (A−
|Lock|) memory blocks with the maximum number of hits given the current lock list
(corresponds to not locking remaining cache lines).
6.3.2 Heuristic Approach
Our heuristic is iterative in nature and exploits the modeling of cache locking described
in section 6.2. As each cache set can be modeled independently, the iterative algorithm
is applied for each cache set separately. So given a cache set Ci, our goal is to determine
Locki such that num hit(Ti) is maximized.
CHAPTER 6. INSTRUCTION CACHE LOCKING 92
Algorithm 3: Bound computation
Function(ComputeBound(M, Lock))1










M ′′ := ∅;7
while |M ′|+ |M ′′| < |M | do8
find m′ s.t {num hit(m′)|Lock} = maxm∈M\M ′′{num hit(m)|Lock};9
Bound := Bound+ {num hit(m′)|Lock};10






Initially, we set Locki = ∅ and compute the number of cache hits in the original
cache
current hit = {num hit(Ti)|∅}
In each iteration, we go through all the unlocked memory blocks in the cache set m ∈
Mi\Locki and compute the number of cache hits if m was locked in the cache.
new hitm = {num hit(Ti)|Locki ∪ {m}}




(new hitm − current hit)
If benefit ≤ 0, then locking any of the remaining memory blocks would worsen the
memory performance and we should terminate our iterative algorithm. Otherwise, we
choose the memory block m with the maximum benefit, i.e., benefit = new hitm −
current hit. We break ties arbitrarily. The algorithm also terminates when |Locki| =
A, i.e., we have locked all the blocks in the cache set. Our cache locking algorithm is
detailed in Algorithm 4.
6.4 Experimental Evaluation
Experimental Setup. We select benchmarks from MiBench and MediaBench for
evaluation purposes. The benchmarks and their characteristics are shown in Table 6.1.
We conduct our experiments using SimpleScalar framework [14]. We evaluate the ef-
fectiveness of our algorithms with different cache parameters. We vary the cache size
(2KB, 4KB, 8KB, 16KB) and associativity (1, 2, 4, 8), but keep the block size constant
(32 bytes). The extra code to fetch and lock memory blocks are inserted at the end of
the program as a trampoline. Thus, it will not affect the original program layout. Cache
hit latency is assumed to be 1 cycle and cache miss penalty is assumed to be 100 cycles.
We perform all the experiments on a 3GHz Pentium 4 CPU with 2GB memory.
We first generate the instruction trace of each benchmark using sim-profile, a func-
tional simulator. Given the address trace and the cache configuration, we can easily
CHAPTER 6. INSTRUCTION CACHE LOCKING 94
Algorithm 4: Heuristic cache locking algorithm
foreach set Ci in the cache do1
Locki := ∅; flag := TRUE;2
current hit := {num hit(Ti)|Locki};3
while flag do4
benefit := 0;5
foreach m ∈Mi\Locki do6
new hitm := {num hit(Ti)|Locki ∪ {m}};7
if (new hitm − current hit) > benefit then8
benefit := new hitm − current hit;9
selected block := m;10
11
if benefit > 0 then12
Locki := Locki ∪ selected block;13
else14
flag := FALSE;15





CHAPTER 6. INSTRUCTION CACHE LOCKING 95
Benchmark Trace (MB) Average TRP (KB) Runtime (sec)
Our-Heuristic Anand-Barua Speedup
Adpcm 220 4.68 11.62 2178 187
Sha 103 6.63 5.412 1092 201
Rijndael 147 12.99 8.31 2289 275
Blowfish 318 10.41 17.38 3558 204
Dijkstra 293 10.31 15.93 3120 195
Bitcnts 170 3.32 9.37 1458 155
Basicmath 1400 34.38 80.98 21312 263
Qsort 360 12.13 20.54 3970 193
Susan 220 10.26 11.79 2561 217
Stringsearch 39 10.41 2.16 425 196
FFT 790 25.88 47.66 11418 239
Jpeg 235 50.19 14.43 2505 173
Lame 965 117.56 73.96 12979 175
Gsm 297 22.38 16.78 4766 284
Mpeg2dec 222 44.13 12.64 2668 211
Table 6.1: Characteristics of benchmarks.
CHAPTER 6. INSTRUCTION CACHE LOCKING 96
create the temporal reuse profile (TRP). Both the trace size and TRP size are shown in
Table 6.1. Note that the TRP size depends on the cache configuration. The table shows
the average TRP size across all evaluated cache configurations. The temporal reuse
profile is significantly more compact compared to the address trace (KB vs. MB).
The TRP size across various cache configurations are shown in Figure 6.2. For
each cache configuration, its TRP size is significantly more compact compared to the
address trace (KB vs. MB). The TRP size depends on the cache configuration because
cache hits for various configurations vary and TRP only records cache hits (only cache
hits are affected by locking). More importantly, for different cache configurations, the
temporal conflict set may be different. As a result, the TRP size depends on the cache
configuration. From Figure 6.2, we observe that for the same size cache, most likely the
size of TRP increase with associativity. For the same size of cache, when associativity
(the number of cache ways) increases, the number of cache sets will reduce to keep
cache size constant. Thus, there are more memory blocks mapped to each cache set
for highly associative cache. Memory blocks mapped to same cache set will conflict
with each other. Hence, the temporal conflict set will be more complicated for highly
associative cache. This explains why TRP size increases with associativity.
We also observe that for the caches with the same associativity, most likely the
size of TRP slightly increases with the cache size. This maybe due to the fact that the
number of cache hits will increase when the cache size increases and TRP only records
cache hits. However, when the cache size is sufficiently large for the temporal locality,
the TRP size will decrease (e.g. lame, jpeg). Though the number of cache hits increases,
CHAPTER 6. INSTRUCTION CACHE LOCKING 97
































































Figure 6.2: TRP size across different cache configurations.
CHAPTER 6. INSTRUCTION CACHE LOCKING 98
the temporal conflict set (conflicting memory blocks in between) becomes simple when
the cache size increases. As a result, the TRP size decreases.
We propose two algorithms for cache locking: an optimal solution and a heuristic
approach. We present the results for the heuristic first, followed by a comparison of the
two algorithms.
Miss Rate Improvement. The instruction cache miss rate improvement with locking
(heuristic) over a cache without locking is shown in Figure 6.3 for different cache size
(2KB, 4KB, 8KB and 16KB). For each cache size, we vary the associativity from 1 to 8.
For any cache size, the miss rate improvement for direct mapped cache is minimal. This
is expected as only one block is available per cache set and locking that block implies
miss for all the remaining memory blocks mapped to the cache set. However, for set as-
sociative caches, our locking algorithm achieves significant performance improvement
across all the benchmarks. We obtain 14% improvement on an average for 2KB cache,
20% improvement on an average for 4KB cache, 24% improvement on an average for
8KB cache, and 20% improvement on an average for 16KB cache. The average miss
rate improvement reduces from 8KB cache to 16KB cache. This is because for most of
benchmarks 16KB cache is sufficient large. Thus, the cache miss rate is already small.
Therefore, there is not much opportunities to improve.
Execution Time Improvement. Figure 6.4 shows the execution time improvement
for different cache size (2KB, 4KB, 8KB and 16KB). For each cache size, we vary
the associativity from 1 to 8. Some benchmarks do not gain considerable execution
CHAPTER 6. INSTRUCTION CACHE LOCKING 99
nt













































































Figure 6.3: Miss rate improvement (percentage) over cache without locking for various cache
configurations.
CHAPTER 6. INSTRUCTION CACHE LOCKING 100
time improvement even though cache miss rate is improved. This is because for these
benchmarks the absolute cache miss number without cache locking is very small. Thus,
improvement in cache miss rate will not contribute much to the overall execution time
reduction. We obtain 10% improvement on an average for 2KB cache, 12% improve-
ment on an average for 4KB cache, 10% improvement on an average for 8KB cache,
and 7% improvement on an average for 16KB cache.
Energy Consumption Improvement. Figure 6.5 shows the memory hierarchy en-
ergy consumption improvement for different cache size (2KB, 4KB, 8KB and 16KB).
For each cache size, we vary the associativity from 1 to 8. For different cache con-
figurations, the energy consumed per cache access is different. We model the energy
consumption of different cache configurations using the CACTI [89] model for 0.13µm
technology. In this work, our focus is dynamic energy consumption. As for the energy
consumption of one access to memory, it is assumed to be 200 times of energy consump-
tion of cache hit [106]. We obtain 11% improvement on an average for 2KB cache, 13%
improvement on an average for 4KB cache, 12% improvement on an average for 8KB
cache, and 9% improvement on an average for 16KB cache.
Heuristic versus Optimal. The cache miss improvement comparison of heuristic and
optimal solution for 2-way set associative caches is shown in Figure 6.6. The cache miss
improvement is an average value across all the cache sizes. As shown, the heuristic
returns close to optimal solutions. For 2-way set associative caches, our heuristic im-
proves instruction cache miss rate by 15.6% on an average, while the optimal solution
CHAPTER 6. INSTRUCTION CACHE LOCKING 101
nt





















4K Cache 1-way 2-way
50%m
en











































16K Cache 1-way 2-way
50%m
en





















Figure 6.4: Execution time improvement (percentage) over cache without locking for various
cache configurations.




















































































































Figure 6.5: Energy consumption improvement (percentage) over cache without locking for
various cache configurations.


























Figure 6.6: Cache miss rate improvement comparison of heuristic and optimal algorithm for
2-way set associative cache.
improves it by 15.8%. For direct mapped caches, both heuristic and optimal solution
achieve marginal improvement. For 4-way associative cache, the heuristic achieves
21.1% miss rate improvement on an average, while the optimal solution returns 21.4%
improvement. As for runtime of the algorithms, the heuristic is 1−273 times faster than
optimal for low associativity caches (≤ 4). For 8-way associative cache, the heuristic
returns solutions quite fast (Table 6.1), but optimal algorithm fails to terminate within
10 hours for some big benchmarks.
Comparison with Anand-Barua Method. We compare our heuristic with Anand-
Barua method [12] — the only other approach in the literature targeting cache lock-
ing for average-case program performance improvement. Their proposal is an iterative
simulation-based heuristics and needs feedback from trace driven simulator in each it-
eration. We implement their algorithm and compare against our heuristic both in terms
of performance (cache miss rate improvement) and efficiency (algorithm runtime).
CHAPTER 6. INSTRUCTION CACHE LOCKING 104










































































































Figure 6.7: Average cache miss rate improvement comparison.
CHAPTER 6. INSTRUCTION CACHE LOCKING 105
In terms of cache miss rate improvement, our approach generally performs better or
at least equal compared to Anand-Barua’s method for every cache configuration. Fig-
ure 6.7 shows the average instruction cache miss rate improvement for 2KB, 4KB, 8KB
and 16KB cache sizes. The miss rate improvement shown in Figure 6.7 is the average
across all set-associative caches (2, 4 and 8 way) as both the methods do not gain much
for direct mapped caches. As evident from Figure 6.7, our heuristic achieves higher
cache miss rate improvement than Anand-Barua’s method across all the benchmarks
and cache configurations. For benchmarks blowfish and stringsearch, the improve-
ment over Anand-Barua’s method are more than 20% for some configurations. This
is because our cache modeling is accurate whereas cache hit/miss are approximated in
Anand-Barua’s work. Our approach obtains 14% improvement on an average while
Anand-Barua’s method achieves 12% on an average for 2KB cache; our approach ob-
tains 20% improvement on an average while Anand-Barua’s method achieves 14% on
an average for 4KB cache; our approach obtains 24% improvement on an average while
Anand-Barua’s method achieves 17% on an average for 8KB cache; our approach ob-
tains 20% improvement on an average while Anand-Barua’s method achieves 10% on
an average for 16KB cache. We notice that the miss rate returned by our approach and
Anand-Barua’s method are close for small cache (2KB). This is probably because there
are more conflicting memory blocks per cache set for a small cache. Thus, the nega-
tive impact of locking a memory blocks may be big for both our precise modeling and
Anand-Barua approximation. However, for larger size of cache, it turns out that our
approach is much better than Anand-Barua’s method.
CHAPTER 6. INSTRUCTION CACHE LOCKING 106
Anand-Barua method invokes cache simulation in each iteration. However, cache
simulation can be very slow for large traces. In addition, the number of simulations
required grows linearly with the total number of locked memory blocks. When, the
number of memory blocks to be locked is not that small, simulation based approach
may not be feasible. In contrast, we only need one round of profiling and the subsequent
analysis relies only on those compact profiles. The runtime comparison of our heuristic
and Anana-Barua’s method is detailed in Table 6.1 under column Runtime. The time
presented is the average runtime across 12 cache configurations. Our approach is 155−
284 times faster compared to Anand-Barua’s method.
Impact of Replacement Policy. Our TRP based cache modeling is developed under
the assumption that replacement policy is least recently used (LRU). Berg and Hager-
ston observed that different replacement policy may have little effect on the miss ratio
for most of applications, but small differences exist [20]. We evaluate our techniques
for other replacement policies as well. We try replacement policies First In First Out
(FIFO) . For FIFO replacement policy, we first collect the cache misses without cache
locking. Then, we use our techniques which employs TRP cache modeling to select
the memory blocks to lock for each cache set. Finally, for the modified program (with
locking), we collect the cache misses using FIFO replacement policy again.
The instruction cache miss rate improvement with locking (heuristic) over a cache
without locking for FIFO replacement policy is shown in Figure 6.8 for different cache
size (2KB, 4KB, 8KB and 16KB). For each cache size, we vary the associativity from
CHAPTER 6. INSTRUCTION CACHE LOCKING 107
1 to 8. We observe that our cache locking technique is still quite effective for FIFO
replacement policy. We obtain 11% improvement on an average for 2KB cache, 15%
improvement on an average for 4KB cache, 19% improvement on an average for 8KB
cache, and 16% improvement on an average for 16KB cache.
Code Memory Layout. The performance of the cache locking algorithm critically
depends on the code memory layout. In the discussion so far, we have assumed that
we start with the “natural” code layout. However, instruction cache performance can
be improved significantly throughout procedure placement — reordering procedures so
that cache conflicts are reduced [40]. Clearly, procedure placement and cache locking
are complementary approaches. In the following, we evaluate the effects of these two
techniques on cache performance. For procedure placement, we choose TPCM [40] —
a state of the art procedure placement technique. In TPCM, the conflicts among pro-
cedures are modelled using procedure temporal relationship (i.e., which procedures are
referenced between two consecutive accesses to another procedure). Then, the proce-
dure temporal relationship are used along with cache configuration and procedure sizes
by TPCM to estimate cost/benefit of procedure placement and determine procedure lo-
cations. TPCM does not guarantee improvement of instruction cache performance after
procedure reordering. In contrast, our cache locking algorithm, by design, is guaranteed
to either improve the performance or keep it the same. Moreover, procedure placement
techniques are effective for applications with substantial number of procedures such as
Jpeg, Lame, and Mpeg2dec.
CHAPTER 6. INSTRUCTION CACHE LOCKING 108
nt




















































































Figure 6.8: Miss rate improvement (percentage) over cache without locking for various cache
configurations. for FIFO replacement policy.












































-10% -10%1way 2way 4way 8way 1way 2way 4way 8way 1way 2way 4way 8way
Figure 6.9: Procedure placement (TPCM) vs Cache locking. Cache size is 8K.
Figure 6.9 shows the comparison of TPCM with cache locking in terms of cache
miss rate improvement. We note that procedure placement performs very well for di-
rect mapped caches, while cache locking achieves very small or no improvement at
all. In general, procedure placement is a good choice for low associativity caches (1
or 2), while locking is more suitable for higher associativity caches (2, 4 and 8). Pro-
cedure placement may not be good choice for higher associativity caches due to the
following reasons. First, higher associativity leads to fewer cache sets leaving little
opportunity for procedure reordering. In contrast, higher associativity provides more
opportunity for cache locking. Moreover, procedure placement may incur performance
loss (see Mpeg2dec) due to coarse grained performance modeling, while our cache
locking heuristic is guaranteed not to degrade the performance.
Clearly, procedure placement and locking are complementary approaches and the
cache performance can benefit significantly through a combination of these two ap-
proaches. To validate our hypothesis, we first performed procedure placement for each
benchmark. If the newly generated layout degrades the performance, we eliminate the
layout and revert back to the original layout for cache locking. Otherwise, we perform
cache locking based on the new layout. Figure 6.9 shows the miss rate improvement
CHAPTER 6. INSTRUCTION CACHE LOCKING 110
using this combined strategy. As evident from the figure, layout combined with locking
can achieve significant improvement for some benchmarks.
6.5 Summary
In this chapter, we propose two cache locking algorithms — an optimal algorithm and
a heuristic approach — to improve the average-case instruction cache performance. We
introduce temporal reuse profiles (TRP) to model the cost and benefit of cache locking
precisely and efficiently and exploit TRP in both the algorithms. Experiment results
indicate that our heuristic can improve cache miss rate by as much as 24% and achieves
close to the optimal results. In addition, compared to the state of the art approach, our
heuristic is better both in terms of performance and efficiency.
Chapter 7
Procedure Placement
In this chapter, we consider procedure placement to further optimize the performance
of instruction cache. Procedure placement is a popular technique that aims to improve
instruction cache hit rate by reducing conflicts in the cache through compile/link time
reordering of procedures. However, existing procedure placement techniques make
reordering decisions based on imprecise conflict information. This imprecision leads to
limited and sometimes negative performance gain, specially for set-associative caches.
We introduce intermediate blocks profile (IBP) to accurately but compactly model cost-
benefit of procedure placement for both direct mapped and set associative caches. We
propose an efficient algorithm that exploits IBP to place procedures in memory such
that cache conflicts are minimized. Furthermore, we observe that the code layout for a
specific cache configuration is not portable across different cache configurations. We
propose another algorithm that exploits IBP to generate a neutral code layout such that
the average cache miss rate across a set of cache configurations is minimized.
111
CHAPTER 7. PROCEDURE PLACEMENT 112
7.1 Introduction
Profile based procedure placement is proposed as one of the well-known instruction
cache optimization techniques, which aims to reorder the procedures in the compile/link
time such that cache conflict misses are eliminated during run-time. State of the art of
procedure placement techniques [40, 45] generate a specific code layout for a particular
cache configuration. All these techniques require the cache design parameters such as
cache line size and total cache size as inputs. This is because the solutions are created by
reasoning about where the procedures should be placed in the cache, which inevitably
requires the knowledge of line size and cache size. Being aware of the underlying cache
parameters, these techniques are shown to be better than the earlier works that neglect
them [53, 79].
However, state of the art procedure placement techniques suffer from some draw-
backs. First, the cost and benefit of placing a procedure are modelled approximately.
The conflict metric (the approximation of cache misses) is defined at granularity of pro-
cedures. However, inside a procedure, there might be more than one program path (i.e.,
a sequence of instructions). The conflict metric of different paths in the same procedure
may not be the same. Thus, using the existing techniques, it is possible that the new
code layout generated is worse than the original code layout due to this imprecise con-
flict information. Second, the techniques are mainly designed for direct mapped caches
and do not model set associative caches accurately. Due to the above two reasons,
the existing procedure placement techniques are not very effective for set associative
caches. To solve these two problems, we introduce intermediate blocks profile (IBP) to
CHAPTER 7. PROCEDURE PLACEMENT 113
precisely model the cost and benefit of procedure placement. IBP is significantly more
compact compared to memory traces. So, the cache performance evaluation using IBP
is much more efficient than cache simulation. Based on the precise cache model using
IBP, our procedure placement algorithm starts from the original procedure ordering and
selects the most beneficial procedures along with their placements iteratively.
Moreover, we observe that the code layout generated for a specific cache configu-
ration by utilizing its parameters (cache size, associativity) may not be portable across
platforms with varying cache configurations. This problem exists for all procedure
placement techniques that rely on cache parameters [40, 45, 50, 39, 17]. Such portabil-
ity issue is very important in situations where the underlying hardware platform (cache
configuration) is unknown. This is common for embedded systems where the code be-
comes available during deployment through either download or portal media [72]. In
such situations, compiler/linker may not know the underlying cache configurations and
thus is unable to generate a code layout appropriate for the particular cache configu-
ration. More importantly, the cache configurations across platforms could be different
due to different versions of the processor or technology evolution. Thus, the same exe-
cutable (code layout) may have to run on systems with different cache configurations.
For the portability problem, we concern ourselves with the scenario where an appli-
cation can be run on platforms with same instruction set architecture but different cache
configurations. To overcome the portability problems across platforms with varying
cache configurations, we propose a procedure placement algorithm to generate a “neu-
tral” code layout by using IBP and structural relations among different cache configu-
CHAPTER 7. PROCEDURE PLACEMENT 114
rations. The neutral code layout performs well for a set of cache configurations on an
average.
7.2 Procedure Placement Problem
In this section, we first introduce the cache terminology and then formally define pro-
cedure placement problem.
Cache Terminology. Recall that we use L to represent block or line size, N to repre-
sent number of sets, and A to represent associativity. Now the cache size is defined as
(N ×A×L). In this work, we consider Least Recently Used (LRU) replacement policy
where the block replaced is the one that has been unused for the longest time.
Given a memory address m, its corresponding memory block and cache set which
the memory block is mapped to are
memory block(m) = bm/Lc
cache set(m) = bm/Lc modulo N
Thus, given a memory address or memory block, it is mapped to only one cache set.
Figure 7.1 shows an example of memory address mapping. In the example, line size is
considered to be two bytes (last bit). Memory address m1 is mapped to set 2 in a 4-set
cache, but mapped to set 0 in a 2-set cache.
Given a procedure p, its starting memory line can be defined as K ×N + s, where
0 ≤ s < N and K ≥ 0. For procedure p, s is its starting cache set number when
CHAPTER 7. PROCEDURE PLACEMENT 115




m2, m31 1 m12
2 3m , m3
Figure 7.1: Memory address mapping. The address is byte address and line size is assumed to
be 2 bytes (last bit).
mapped to cache and s affects the cache conflicts between p and other procedures; K
determines its location in memory and K does not affect the cache conflicts but mem-
ory size. Thus, procedure placement technique involves two phases: cache placement
and memory placement. Cache placement phase determines s for each procedure to
minimize conflicts; memory placement determines K for each procedure and aims to
minimize the code size. In this work, we propose a new procedure placement algorithm
using IBP for cache placement phase. As for memory placement, Guillon et al. [45]
provides an optimal solution for memory placement problem and we employ their tech-
nique to minimize the code size.
7.3 Intermediate Blocks Profile
LetP be the set of procedures of the program. Given a procedure p ∈ P , we use pstart to
denote its starting address in the original code layout, pset to denote its starting cache set
number (0 ≤ pset < N ) and psize to denote its size in bytes. For a procedure p ∈ P , pset
may be changed by procedure placement to improve the cache performance. However,
CHAPTER 7. PROCEDURE PLACEMENT 116
procedure placement reorders instructions at the granularity of procedures. Thus, the
instructions inside a procedure are still contiguous even if the procedure’s location is
changed. Thus, given an instruction, its relative offset to the staring address of the
procedure is never changed during procedure placement.
Definition 14 (Procedure Block). Given a memory address m, its procedure block is
defined as a tuple 〈p, l〉 where m belongs to procedure p and l = bm−pstart
L
c.
Now, let T be the memory trace (sequence of memory references) generated by
executing a program on the target architecture. This trace is generated using the original
code layout. We transform the memory trace T to its corresponding procedure block
trace T ′ by representing each memory reference in T using its corresponding procedure
block. The trace T ′ remains the same during procedure placement while trace T is not.
Furthermore, for caches with different size and associativity, but with same line size,
their procedure block traces are the same.
In the following, we will present our techniques based on the procedure block trace
T ′. Let B be the set of procedure blocks of the program. Given a procedure block
b ∈ B, let us define the jth reference of b in the trace T ′ as b[j].




1 if ∃k ∈ Z s.t.
l + pset − l′ − p′set = k ×N
0 otherwise
CHAPTER 7. PROCEDURE PLACEMENT 117
In other words, conflict(b, b′) returns 1 if b and b′ are mapped to the same cache
set; otherwise, it returns 0.
Definition 16 (Procedure Block Interval). A procedure block interval is a tuple 〈p, s, e〉.
It represents a sequence of contiguous procedure blocks which belong to the same pro-
cedure {〈p, l〉 : s ≤ l ≤ e}.
Given two procedure block intervals which belong to the same procedure 〈p, s1, e1〉
and 〈p, s2, e2〉, they can be merged to a bigger procedure block interval 〈p, s1, e2〉 if
s2 = e1 + 1 or 〈p, s2, e1〉 if s1 = e2 + 1.
Definition 17 (Intermediate Blocks Set (IBS)). Given a procedure block reference
b[j](j > 1) in the trace where b ∈ B, let Sbetween be the set of unique procedure blocks
referenced between b[j− 1] and b[j] in T ′. If there is no such reference, then Sbetween =
φ. Let procedure block b be 〈p, l〉 and Sother be {〈p′, l′〉 : 〈p′, l′〉 ∈ Sbetween ∧ p′ 6= p}.
Then, the intermediate blocks set (IBS) of procedure block reference b[j], IBSb[j] is
defined as a tuple 〈S,C〉, where





More clearly, IBSb[j] has two parts. The first part is the set of unique procedure
blocks from other procedures referenced between b[j − 1] and b[j](Sother) in procedure
block interval format. The second part is the number of conflicts encountered from the
procedure blocks in between which are from procedure p itself. Given two procedure
CHAPTER 7. PROCEDURE PLACEMENT 118
dda ress 
0001 0010 0011 0100 00100000trace(byte) 0000
< P 1 > < P 0 > < P 1 > < P 2 > < P 0>< P 0 >procedure < P 0> 0 ,   1 ,   1 ,   1 ,   0,   0 ,  block trace  1,   
Procedure Attributes
Procedures Start Address (byte) Size (byte)
P 0000 20
P1 0010 3
Procedure block Intermediate Blocks Profile 
< P0, 0 > {  < s, f (s) = 1 > }, where s = < { < P1, 0, 2 > },  0 >
< P 0 > { < s f(s) = 1 > } where s = < { < P 0 0 > } 1 > 1,     ,     ,      0, ,   ,   
Figure 7.2: Procedure block trace and intermediate blocks profile. Block (line) size is assumed
to be 1 byte. The number of cache sets is assumed to be 2.
blocks belonging to the same procedure, their conflict (defined by Definition 15) is
not affected by procedure placement because their relative offset are not changed by
procedure placement.
For different references of the same procedure block, they may have the same inter-
mediate blocks set. More importantly, for intermediate blocks set 〈S,C〉which does not
interact with other procedures (i.e., S = φ), it is not affected by procedure placement.
Let
IBSb = {〈S,C〉 : ∃j > 1 s.t. 〈S,C〉 = IBSb[j] ∧ S 6= φ}
Definition 18 (Intermediate Blocks Profile). The intermediate blocks profile IBPb of
a procedure block b is defined as a set of 2-tuples {〈s, f(s)〉} where s ∈ IBSb and f(s)
denotes the frequency of the intermediate blocks set s of procedure block b in the trace.
CHAPTER 7. PROCEDURE PLACEMENT 119
In Figure 7.2, we show an example of procedure block trace and its corresponding
intermediate blocks profile. More concretely, for the second reference of procedure
block 〈P0, 0〉, its intermediate blocks set is 〈{P1, 0, 2}, 0〉 because 3 procedure blocks
〈P1, 0〉, 〈P1, 1〉 and 〈P1, 2〉 from P1 are accessed in between and there is no procedure
blocks from P0 which conflict with 〈P0, 0〉 are accessed in between (i.e., 〈P0, 1〉 does
not conflict with 〈P0, 0〉). For the second reference of procedure block 〈P1, 0〉, its in-
termediate blocks set is 〈{P0, 0, 0}, 1〉 because one procedure block 〈P0, 0〉 from P0
is accessed in between and there is one procedure block from P1(〈P1, 2〉) which con-
flicts with 〈P0, 0〉 is accessed in between. 〈P1, 0〉 conflicts with 〈P1, 2〉 because line gap
between them is 2 and the number of cache sets is 2 in the example.
Now, given a procedure block b and one of its intermediate blocks set, its cache
behavior under least recently used replacement policy can be determined as follows:









conflict(b, 〈p, l〉)) + C < A
0 otherwise
CHAPTER 7. PROCEDURE PLACEMENT 120
7.4 Procedure Placement Algorithm
In this section, we present our procedure placement algorithm for a specific cache con-
figuration by utilizing IBP. However, not all the procedures are invoked during execu-
tion or frequently called. We only consider the hot procedures for placement.
Hot Procedures. For a procedure p, we define its hot attribute according to its inter-






We sort procedures in decreasing order of phot. Let total =
∑
p∈P phot. We use HotP
to denote the set of hot procedures we will consider for procedure placement. We
keep adding the next hottest procedure among the rest of procedures to HotP until∑
p∈HotP phot ≥ total × Thres, where 0 < Thres ≤ 1. Initially, HotP = ∅.
Note that the hot procedures we define may be different from traditional hot pro-
cedures (i.e., procedures which consume significant portion of a program’s total exe-
cution time). This is because time consuming procedures may rarely switch control
flow to other procedures or frequently called by other procedures. On the other hand,
procedure placement is a technique that aims to reduce conflict misses due to procedure
switching. Thus, our procedure placement technique uses the interaction of a procedure
with other procedures rather than execution time as the hot attribute.
Our procedure placement technique is presented in Algorithm 5. We start from the
original procedure order (i.e., the procedure sequence in the original code layout). We
CHAPTER 7. PROCEDURE PLACEMENT 121
Algorithm 5: Procedure Placement Algorithm
set = 0;1
Let List be the list of procedures in the original order;2
for i← 1 to |List| do3
p = L[i]; pset = set; set = set + dpsize/Le;4
foreach b ∈ B do5
foreach 〈s, f(s)〉 ∈ IBPb do6
hit[b][s] = hit(b, s)× f(s);7
8
flag = TRUE;Placed = ∅;9
while flag do10
benefit = 0;11
foreach p ∈ HotP ∧ p /∈ Placed do12
old set = pset;13
for dis← 0 to N − 1 do14
pset = dis;15
new benefit = getBenefit(p);16
if new benefit > benefit then17
benefit = new benefit; selected set = dis; p′ = p;18
19
pset = old set;20
21
if benefit > 0 then22
p′set = selected set; Placed = Placed ∪ p′;23
foreach b ∈ B do24
foreach 〈s, f(s)〉 ∈ IBPb do25
if p′ ∈ IPSsb then26








CHAPTER 7. PROCEDURE PLACEMENT 122
Algorithm 6: Function getBenefit(p)
function(getBenefit(p))1
benefit = 0;2
foreach b ∈ B do3
foreach 〈s, f(s)〉 ∈ IBPb do4
if p ∈ IPSsb then5




place procedures one by one and place them at a cache line boundary (line 3-4). We use
a two dimension array hit[b][s] to record the hits of procedure block b for intermediate
blocks set s ∈ IBPb. Array hit[][] is initialized based on the original procedure order
(line 5-8). Then, in each iteration of the loop, we walk through all the hot procedures
which are not placed so far and try out all the displacement values dis ∈ {0, . . . , N−1}
for them. For each iteration, we select the procedure and its corresponding displacement
value (cache set) that result in maximum benefit. If there is no benefit, the iterative
process is terminated.
Let us assume procedure p is selected for placement. Function getBenifit(p) de-
scribed in Algorithm 6 returns the benefit of this placement compared to the code layout
of previous iteration.
CHAPTER 7. PROCEDURE PLACEMENT 123
Definition 20 (Influential Procedure Set (IPS)). Given a procedure block b and one
of its intermediate blocks set s ∈ IBSb, let b be 〈p, l〉 and s be 〈S,C〉. The influential
procedure set IPSsb is {p} ∪ {p1 : 〈p1, s1, e1〉 ∈ S}.
The influential procedure set (IPS) for the procedure block b and intermediate blocks
set s ∈ IBSb, IPSsb , is just the set of procedures invoked between the two occurrences
of procedure block b.
Property 1. Given a procedure block b and intermediate blocks set s ∈ IBSb, hit(b, s)
is not affected by the placement of procedure p if p /∈ IPSsb .
Property 1 can be easily observed following Definition 15 and 19. In function get-
Benifit(p) described in Algorithm 6, given a procedure block b and one of its interme-
diate blocks set s, we will consider them only if they are affected by the placement of
procedure p (i.e., p ∈ IPSsb ). If p /∈ IPSsb , then hit(b, s) is not affected. In other
words, there is no performance gain or loss. In each iteration, once a procedure and
its corresponding cache set number are selected, the affected entries in hit[][] will be
updated (line 24-27) in Algorithm 5.
7.5 Neutral Procedure Placement
In the last section, we describe an algorithm to generate a new code layout for a specific
configuration using IBP. We observe that the code layout generated for a specific cache
configuration is so tied to the specific configuration that it is not portable across different
cache configurations (see section 7.6.2). More importantly, the code layout portability is
CHAPTER 7. PROCEDURE PLACEMENT 124
a problem for all the techniques that are aware of cache parameters [40, 45, 50, 39, 17].
In this section, we will present an algorithm to generate a “neutral” code layout for a set
of cache configurations. The neutral code layout achieves better average performance
(i.e., average number of cache misses across a set of cache configurations) than any
specific code layout.
We are interested in a set of cache configurations with different cache size and asso-
ciativity, but with constant line size. Given cache size (S), line size (L) and associativ-
ity (A), the number of cache sets is S
L×A . We use C
S
A to denote the cache configuration
with size S and associativity A. Then, the set of cache configurations we support is
Config = {CSA|Smin ≤ S ≤ Smax;Amin ≤ A ≤ Amax; }, where Amin(Amax) is the
minimum (maximum) associativity and Smin(Smax) is the minimum (maximum) cache
size. We use Nmax(Nmin) to represent the maximum (minimum) number of cache sets
for the configurations in Config. Obviously, Nmax = SmaxL×Amin and Nmin =
Smin
L×Amax .
For a procedure block reference b[j], let b be 〈p, l〉. Its intermediate blocks set
IBSb[j] is defined as 〈S,C〉 in Section 7.3, where C is the number of procedure blocks
from p itself which conflict with b between the reference b[j − 1] and b[j]. However,
for the configurations in the set Config, they may have different number of cache sets.
According to Definition 15, the conflict for two procedure blocks may be different for
caches with different number of cache sets. Thus, we extend the definition of interme-
diate blocks set IBSb[j] to 〈S,C[]〉, where S is the same as before, but C[] is an array
and C[i] returns the conflicts for the cache with i cache sets.
CHAPTER 7. PROCEDURE PLACEMENT 125
Our aim is to improve the average performance for the set of cache configurations
Config. Our algorithm is similar to Algorithm 5 with a few changes. First, for a
procedure p, we use pset to represent its starting cache set number when mapped to
cache CSmaxAmin (i.e., the cache with maximum number of cache sets, Nmax). Thus, 0 ≤
pset < Nmax. When mapped to cache CSA, the starting cache set number of procedure p
is (pset modulo N ′), where N ′ is the number of cache sets for cache CSA(N
′ = S
L×A).
In other words, for a procedure p, its starting cache set in CSmaxAmin uniquely determines its
starting cache set when mapped to other cache configurations in Config with smaller
number of cache sets. So, once a hot procedure is selected for placement, we try all
the displacements dis ∈ {0, . . . , Nmax − 1} for it. Second, given a procedure block
b and one of its intermediate blocks set s ∈ IBPb, hit[b][s] returns the total number
cache hits of all the configurations in the configuration set Config rather than a specific
configuration. Finally, function getBenefit(p) defined in Algorithm 5 only returns
the benefit for a specific cache configuration when p is selected for placement. In the
following, we will define function getBenefit Set(p) which returns the benefit for the
set of cache configurations Config.
For a procedure block b, let b be 〈p, l〉. It will be mapped to cache set ((pset +
l) modulo N ′) in cache with N ′ cache sets. According to Definition 15, we have
Property 2. Two procedure blocks that conflict in C1 ∈ Config with N1 cache sets,
will conflict in C2 ∈ Config with N2 cache sets, if N1 > N2.
The new getBenefit Set(p) function is detailed in Algorithm 7. Let us assume
the selected procedure for placement is p. Then, for a procedure block b and one of
CHAPTER 7. PROCEDURE PLACEMENT 126
Algorithm 7: Benefits for a set of configurations
function(getBenefit Set(p))1
Let conf[] be an array ;2
benefit = 0 ;3
foreach b ∈ B do4
foreach 〈s, f(s)〉 ∈ IBPb do5
if p ∈ IPSsb then6
Initialize conf[] to 0;7
Let s be 〈S′, C′[]〉 and b be 〈p′, l′〉 ;8
foreach intermediate blocks set 〈p′′, s′′, e′′〉 ∈ S′ do9
for l′′ ← s′′ to e′′ do10
set = match(p′′set + l′′, p′set + l′) ;11
if set > Nmax then12
set = Nmax ;13
conf [set] = conf [set] + 1;14
15
for set← Nmax/2 to Nmin do16
conf [set] = conf [set] + conf [set ∗ 2];17
for set← Nmax to Nmin do18
conf [set] = conf [set] + C′[set];19
total hit = 0;20
foreach CSA ∈ Config do21
set = S
A×L ;22
if conf [set] < A then23
total hit = total hit + f(s);24
25




CHAPTER 7. PROCEDURE PLACEMENT 127
its intermediate blocks set s, we compute its benefit only if it is affected (p ∈ IPSsb ).
If it is affected, then we compare procedure block b with every procedure block in the
intermediate blocks set s. We determine the maximum number of cache sets at which
two procedure blocks conflict by using Match function. Match(a, b) = 2k, where k is
the contiguous right matched bits of a and b.
According to Property 2, the conflicts are propagated from cache with more cache
sets to cache with less cache sets (line 16-17). The number of cache sets is always
power of 2. The conflicts from the procedure blocks belonging to the same procedure
are added too (line 18-19). In the end, we walk through all the cache configurations and
compute the benefit.
7.6 Experimental Evaluation
Experimental Setup. We select benchmarks from MiBench and MediaBench for
evaluation purposes as shown in Table 7.1. We conduct our experiments using Sim-
pleScalar framework [14]. Each benchmark is first run with a training input to generate
an execution trace. Then, each benchmark is recompiled with our analysis activated. We
generate the instruction trace of each benchmark using sim-profile, a functional simula-
tor. Given the address trace, our analysis transforms it to the procedure block trace and
builds corresponding IBP. As discussed in section 7.4, we only select the hot procedures
for procedure placement and in our experiment, we set the threshold for hot procedures
as 0.99. Benchmarks characteristics such as number of hot procedures, size of address
CHAPTER 7. PROCEDURE PLACEMENT 128
Benchmark # Procedures # Hot procedures Trace (MB) IBP (MB)
Cjpeg 324 27 235 5.1
Djpeg 351 29 63 5.2
Gsmdec 188 16 101 2.6
Mpeg2dec 216 23 222 27
Ispell 261 12 242 6.8
Rsynth 188 21 495 6.9
Tiff2bw 498 61 365 4.5
Tiff2rgba 493 68 359 3.2
Table 7.1: Characteristics of benchmarks.
trace, and IBP are shown in Table 7.1. We evaluate the effectiveness of our algorithms
with different cache parameters. We vary the cache size (4KB, 8KB), associativity (1,
2, 4, 8), and block size (32 bytes). We collect the cache misses and execution cycles
using cache simulator and cycle accurate simulator in SimpleScalar. The performance
numbers are collected by running the optimized program with a different input. In other
words, we use different inputs for training and evaluation run. We perform all the ex-
periments on a 3GHz Pentium 4 CPU with 2GB memory. In this work, we focus on
the cache placement (i.e., assign starting set number to procedures to minimize cache
conflict misses). As for memory placement, Guillon et al. [45] present a polynomial
optimal algorithm and we implement their technique to minimize code size.
IBP vs Trace. Both the address trace and IBP size are shown in Table 7.1. As shown,
IBP is significantly more compact compared to the address trace. More importantly,
CHAPTER 7. PROCEDURE PLACEMENT 129
10000











1 2 3 4 5 6 7 8 9 10
S i
Inputs
Figure 7.3: CJpeg address trace vs IBP for various inputs with different sizes.
when the input size increases, its address trace will increase while IBP size most likely
will remain the same. This is because when the input size increases, the intermediate
blocks set will most likely stay stable and only the frequency is increased.
In Figure 7.3, we show how the address trace and IBP size change for various inputs
for Cjpeg benchmark. We tried 10 different inputs with different sizes. The raw image
sizes vary from 34K to 2.9M . In Figure 7.3, the inputs are sorted according to their
size. As shown, the trace size increases significantly when the input size increases.
However, the IBP size stays stable when the input size increases. Different inputs may
cover different paths of the program. Thus, a small input may have more intermediate
blocks set than a big input. As a result, the size of the intermediate blocks profile for a
small input may be larger than that of the big input. Hence, it is possible that the IBP
size decreases while the input size increases as shown in Figure 7.3.
7.6.1 Layout for a Specific Cache Configuration
To evaluate our technique, we compare our work with the state of art [45] which ex-
tends [40] to deal with code size. In [45], parameter α(0 ≤ α ≤ 1) is used to control
CHAPTER 7. PROCEDURE PLACEMENT 130



















Figure 7.4: Cache miss rate improvement and code size expansion compared to original code
layout for 4K direct mapped cache.
the tradeoff between miss rate reduction and code size expansion. When α = 1, the
algorithm is biased toward miss rate reduction only, which is exactly the same as tech-
nique [40]. With α = 1, cache miss rate is reduced but with significant code size
expansion. Two other values of α, 0.995 and 0.999995 are used in [45] which gives
the miss rate reduction nearly the same as α = 1 but with smaller code size expansion.
Though the technique in [40, 45] are mainly for direct mapped cache, as shown in [40],
it can be applied to set associative caches as well.
We use original code layout as baseline. Original code layout is the code layout
from the compiler without any procedure reordering. For each technique, we show its
cache miss rate improvement and code size expansion compared to original code layout.
We discuss the comparison between our technique (IBP) and Guillon method [45] for
direct mapped and set associative caches, respectively. For Guillon method [45], we
tried various values of α (1, 0.999995, 0,995).
Direct Mapped Cache. The results are shown in Figure 7.4 and 7.5 for 4K and 8K
cache, respectively. Both techniques are effective for direct mapped cache as shown.
As for Guillon method, our finding is that when α is set to 0.999995(0.995), it achieves
CHAPTER 7. PROCEDURE PLACEMENT 131





















Figure 7.5: Cache miss rate improvement and code size expansion compared to original code
layout for 8K direct mapped cache.
similar miss rate improvement as α = 1, but the code size expansion is much smaller
than α = 1 as shown in [45]. However, Guillon method may generate worse code
layout compared to original code layout due to its imprecise conflict model (e.g., djpeg
for 4K cache and tiff2bw for 8K cache).
For every benchmark and configuration pair, our IBP achieves more cache miss rate
improvement than Guillon method. For 4K cache, IBP improves cache miss rate by
43.5% on an average; Guillon method improves performance by 30%—30.5% on an
average depending on the value of α. For 8K cache, IBP improves cache miss rate by
64% on an average; Guillon method improves performance by 37.3%—38.3% on an
average depending on the value of α. For Guillon method, if the miss rate improvement
is negative, we consider it as 0 when computing average values.
There are two reasons for the gain of IBP over Guillon method. First, IBP is a pre-
cise model working at the granularity of procedure block, but Guillon method is based
on imprecise conflict information. In addition to cache modeling, two techniques dif-
fer in the nature of the procedure placement algorithm. In IBP, we start from original
procedure order and align the procedure at cache line boundaries. In each round, we
CHAPTER 7. PROCEDURE PLACEMENT 132
100%






































(a) Associativity = 2
80%






































(b) Associativity = 4






































(c) Associativity = 8
Figure 7.6: Cache miss rate improvement compared to original code layout for set associative
cache.
try all the hot procedures not placed so far to determine the best procedure for place-
ment and its corresponding placement. However, in Guillon method, the sequence to
place procedures is pre-determined by conflict graph and only the different placement
of procedures are attempted.
IBP based method achieves more miss rate improvement with a very small code
size expansion. For 4K cache, IBP expands code size by 1.6% on an average; Guillion
method expands code size by 0.8%—3% on an average depending on the values of α.
For 8K cache, IBP expands code size by 2.4% on an average; Guillion method expands
code size by 0.9%—16% on an average depending on the values of α.
CHAPTER 7. PROCEDURE PLACEMENT 133
Set Associative Caches. As shown in Figure 7.6(a), (b) and (c), Guillon method is
not always effective for set associativity caches especially when associativity is high (4,
8). For some benchmarks, the code layout from Guillon method is much worse than
the original code layout. For 2-way associative cache, 12 out of 48 code layouts are
worse than the original code layout; for 4-way associative cache, 17 out of 48 code
layouts are worse than the original code layout; for 8-way associative cache, 16 out
of 48 code layouts are worse than the original code layout. On the other hand, IBP
is always better than original code layout. This is because IBP allows us to precisely
model the cache performance for set associative caches while Guillon method does not
model set associative caches accurately.
For 2-way associative cache, IBP improves cache miss rate by 43.7% on an average;
Guillon method improves cache miss rate by 17.3%—20.7% on an average depending
on the value of α. For 4-way associative cache, IBP improves cache miss rate by 33.4%
on an average; Guillon method improves cache miss rate by 8.6%—12.9% on an av-
erage depending on the value of α. For 8-way associative cache, IBP improves cache
miss rate by 24.8% on an average; Guillon method improves cache miss rate by 6.1%—
14.2% on an average depending on the value of α. For Guillon method, if the miss rate
improvement is negative, we consider it as 0 when computing average values.
Finally, for both techniques, the cache miss rate improvement decreases when the
associativity increases. It is because higher associativity leads to fewer cache sets leav-
ing little opportunity for procedure reordering.
CHAPTER 7. PROCEDURE PLACEMENT 134





































Figure 7.7: Execution time improvement compared to original code layout.
IBP achieves high miss rate improvement with only small code expansion. As for
code size, for 2-way associative cache, IBP expands code size by 1.3% on an average;
for 4-way associative cache, IBP expands code size by 1.1% on an average; for 8-way
associative cache, IBP expands code size by 0.9% on an average.
Execution Time Improvement. Figure 7.7 shows the execution time improvement
of new code layout (IBP) compared to the original code layout. These experiments are
conducted assuming single-issue in-order processor with 100 cycles cache miss latency
and 1 cycle cache hit latency. Some benchmarks do not gain considerable execution
time improvement even though cache miss rate is improved. This is because for these
benchmarks the absolute cache miss number without procedure placement is very small.
Thus, improvement in cache miss rate will not contribute much to the overall execution
time reduction. IBP obtains 19.67% execution time improvement on an average for 1-
way cache, 13.53% execution time improvement on an average for 2-way cache, 9.32%
execution time improvement on an average for 4-way cache and 5.3% execution time
improvement on an average for 8-way cache.
CHAPTER 7. PROCEDURE PLACEMENT 135
Energy Consumption Improvement. Figure 7.8 shows the memory hierarchy en-
ergy improvement of new code layout (IBP) compared to the original code layout. For
different cache configurations, we model the energy consumption of the memory hier-
archy using the CACTI [89] model for 0.13µm technology. In this work, our focus is
dynamic energy consumption. As for the energy consumption for one access to mem-
ory, it is assumed to be 200 times of energy consumption of one access to standard level
one cache [106]. IBP obtains 31.2% energy consumption improvement on an average
for 1-way cache, 22.4% energy consumption improvement on an average for 2-way
cache, 15.8% energy consumption improvement on an average for 4-way cache and
10% energy consumption improvement on an average for 8-way cache.
Impact of Replacement Policy. IBP based cache modeling is developed under the
assumption that the replacement policy is least recently used (LRU). However, Berg
and Hagerston observed that different replacement policies may have little effect on the
miss ratio for most of applications, but small differences exist [20]. We evaluate IBP
layout under FIFO replacement policy. The code layout is generated based on LRU
replacement policy, but miss rate is computed using FIFO replacement policy.
In Figure 7.9, we show the cache miss improvement compared to original code
layout for FIFO. We observe that for most applications, IBP code layout is still quite
effective. IBP obtains 43.3% miss rate improvement on an average for 2-way cache,
32.2% miss rate improvement on an average for 4-way cache, and 23.7% miss rate
improvement on an average for 8-way cache. However, we notice that small differences
CHAPTER 7. PROCEDURE PLACEMENT 136







































Figure 7.8: Energy reduction compared to original code layout.
































a I m -20%
Figure 7.9: Cache miss rate improvement of IBP over original code layout for FIFO replace-
ment policy.
do exist. For 4K, 8-way set associative cache, IBP code layout is a little bit worse than
the original code layout for Mpeg2dec for FIFO replacement policy. This is probably
because IBP code layout is not quite effective for this configuration for Mpeg2dec as
shown in Figure 7.6.
Runtime. Our procedure placement algorithm is very efficient thanks to the compact
format of IBP. It only takes a few minutes to complete our analysis for any considered
settings.
7.6.2 Neutral Layout
We evaluate our neutral code layout using the set of configurationsConfig = {CSA|4K ≤
S ≤ 8K; 1 ≤ A ≤ 8; }. For each configuration CSA, we generate a specific S-A-way
CHAPTER 7. PROCEDURE PLACEMENT 137
code layout. So there are total 8 specific code layouts and 1 neutral code layout. We
first present the code layout portability problem in Table 7.2. For each code layout, we
evaluate its portability using all the 8 configurations in Config. Table 7.2 shows the
results of 2 benchmarks Ispell and Rsynth.
Portability. As shown in the Table 7.2, for each configuration, the cache miss varies
significantly across different code layouts. For example, for 4K size, 1-way cache,
Rsynth incurs about 8 million cache misses using the code layout generated for 4K-
1-way configuration while the number of cache miss goes up to more than 62 million
using code layout generated for 4k-2-way cache configuration. As expected, we also
observe that, in most cases, the code layout C generated for the cache configuration C
is the best code layout for cache configuration C among all the code layouts in the set
(diagonal line from top left to bottom right). This is because the underlying cache pa-
rameters match exactly with the assumed cache parameters during procedure placement.
However, the code layout generated for C may perform badly for other configurations,
though it is good for configuration C. For example, for benchmark Ispell, 4k-1-way
code layout is better than 4k-2-way code layout for 4k-1-way cache configuration, but
worse than 4k-2-way code layout for 4k-2-way cache configuration. In other words,
there is no single code layout that performs better than other code layouts for all the
configurations. More importantly, the above portability problem exists for all proce-
dure placement techniques [40, 45, 50, 39, 17] that take cache parameters into account.
This has been observed for [40, 45] as well.
CHAPTER 7. PROCEDURE PLACEMENT 138
Cache Configuration
Ispell
Layout 4K-1-way 4K-2-way 4K-4-way 4K-8-way 8K-1-way 8K-2-way 8K-4-way 8K-8-way
4K-1-way 1,102,555 1,452,167 1,786,182 1,879,787 863,642 498,419 325,743 9,758
4K-2-way 1,508,960 1,287,602 1,428,123 1,752,959 744,746 649,710 729,356 743,937
4K-4-way 1,559,625 1,666,711 1,327,332 1,779,965 858,854 800,705 717,038 791,925
4K-8-way 1,522,588 1,633,681 1,657,776 1,635,333 928,764 961,416 362,726 314,796
8K-1-way 1,487,002 1,558,181 1,811,224 1,946,053 163,051 115,887 68,588 12,269
8K-2-way 1,373,225 1,692,295 1,791,685 1,985,196 395,723 71,407 100,958 7,153
8K-4-way 1,598,770 1,543,161 1,750,547 1,946,337 1,039,925 518,231 9,577 9,018
8K-8-way 1,529,317 1,665,145 1,725,674 1,899,575 1,152,934 653,463 173,629 2,293
Rsynth
Layout 4K-1-way 4K-2-way 4K-4-way 4K-8-way 8K-1-way 8K-2-way 8K-4-way 8K-8-way
4K-1-way 8,672,681 9,648,096 10,411,082 10,905,595 6,436,063 4,640,705 5,102,060 5,337,687
4K-2-way 62,448,135 9,172,659 10,514,702 11,333,635 37,662,987 6,661,833 5,175,124 5,553,046
4K-4-way 43,399,761 46,574,878 9,158,260 9,508,933 12,728,470 11,786,572 565,3478 5,564,737
4K-8-way 27,808,251 24,049,320 17,140,999 8,922,127 7,768,132 6,808,949 5,327,826 5,674,078
8K-1-way 25,684,168 16,974,226 15,573,160 13,028,555 3,467,594 3,960,793 4,680,007 5,487,350
8K-2-way 20,305,855 13,342,605 12,224,826 12,393,127 12,837,206 3,702,064 4,440,956 5,434,249
8K-4-way 25,773,676 21,407,231 12,029,527 10,012,284 12,094,403 8,132,121 4,094,100 5,307,878
8K-8-way 76,932,309 24,046,861 13,506,907 12,147,314 28,077,477 15,401,963 4,743,588 4,570,997
Table 7.2: Cache misses of different code layouts running on different cache configurations.
CHAPTER 7. PROCEDURE PLACEMENT 139
Performance. The comparison of 9 (8 specific + 1 neutral) code layouts in terms of
average performance is shown in Figure 7.10. Y-axis in Figure 7.10 shows the average
cache miss improvement over all cache configurations compared to original code layout.
First, the neutral code layout always performs better than any specific code layout
in terms of average performance for all the benchmarks. For all the benchmarks, neu-
tral code layout achieves positive performance improvement. Though the neural code
layout is better than any other specific code layout, it does not win for all the configura-
tions. For most of the cases, the S-A-way code layout is the best for CSA configuration.
So, our neutral code layout is not the best code layout for a specific cache configuration,
but the best code layout for the average performance across a set of configurations.
Second, we notice that the best specific code layout is different for different bench-
marks. For example, for Djpeg, the best specific code layout is 8K-1-way; for Mpeg2dec,
the best specific code layout is 4K-1-way. Moreover, some specific code layouts for
highly associative caches (4-way and 8-way) degrade average performance compared
to the original code layout (e.g., Mpeg2dec, Rsynth etc). Though the specific code lay-
outs are better than the original code layout for their own configurations, they are worse
than the original code layout for the rest of the configurations. As a result, they are
worse than the original code layout on an average.
Code Size. Neutral code layout achieves better average miss rate improvement with
small code size expansion. On an average, neutral code layout expands code size by
2.3%.
























Dj G d M 2d I ll R h Tiff2 bv
e r





peg sm ec peg ec spe synt rg aA v 8k-8-way
Figure 7.10: Average cache miss rate improvement comparison.
7.7 Summary
In this chapter, we first introduce intermediate blocks profile (IBP) to precisely model
the cost and benefit of procedure placement. Then, we propose a procedure placement
algorithm using IBP. Furthermore, we notice that the code layout generated for a spe-
cific cache configuration is not portable across different cache configurations. So, we
propose an algorithm that can generate a neutral code layout for a set of cache configu-
rations. Experiments indicate that our technique improves both cache performance and
energy consumption compared to the state of the art.
Chapter 8
Putting it All Together
8.1 Integrated Optimization Flow
Clearly, the techniques we propose in this thesis — design space exploration of caches,
procedure placement and cache locking optimizations are complementary approaches.
More importantly, cache performance can benefit significantly through a combination of
these approaches. In this chapter, we propose an integrated exploration and optimization
strategy for improving instruction cache performance by combining all the techniques
together.
The integrated instruction cache cache optimization strategy is described in Fig-
ure 8.1. Our exploration and optimization involve three steps.
• Step One. Perform the static program analysis presented in chapter 4 and 5 to
explore the instruction cache design space. The output is one good cache config-
uration C in terms of cache performance.
141
CHAPTER 8. PUTTING IT ALL TOGETHER 142





Figure 8.1: Integrated instruction cache optimization flow.
• Step Two. Perform the procedure placement optimization presented in chapter 7
to reorder the procedures such that the conflict misses are reduced. The input is
the cache configuration C returned from Step One. The output is an improved
code layout L which gives less instruction cache misses compared to the original
code layout.
• Step Three. Perform the instruction cache locking optimization presented in
chapter 6 to further improve instruction cache performance. The input is the
code layout L generated by Step Two. The output is the cache locking solution.
8.2 Experimental Evaluation
In this section, we demonstrate the effectiveness of our proposed integrated instruction
cache optimization flow. We use benchmarks Gsm and Mpeg2 as case study. We vary































Figure 8.2: Cache miss rate improvement of integrated instruction cache optimizations. Base-
line cache configuration is a direct mapped cache. Step 1: Design Space Exploration (DSE);
Step 2: Procedure Placement (Layout); Step 3: Instruction Cache Locking (Locking).
the cache associativity (1, 2, 4, 8), block size (16, 32 bytes) but keep the cache size
constant (8K). Thus, there are totally 8 cache configurations in the design space.
Figure 8.2 shows the cache miss rate improvement of the proposed integrated in-
struction cache optimizations. We consider 8K, direct mapped and 32 bytes line size as
the baseline cache configuration. The instruction cache miss rate improvement is based
on the baseline cache configuration. As evident from the figure, in each step, cache miss
improvement is achieved. In the end, cache performance benefits significantly through
the proposed integrated instruction cache optimizations.
In summary, cache design space exploration, procedure placement and cache lock-
ing are complementary approaches. Contiguous cache miss improvement is achieved
through the three steps. Our proposed integrated instruction cache optimizations can




The application specific nature of embedded systems creates the opportunity to design
a customized system-on-chip (SoC) platform for a particular application or an applica-
tion domain. With the knowledge of application characteristics, many cache parameters
can be customized to meet various design goals. This is especially true for parameteriz-
able embedded systems. Furthermore, the program code can be transformed in various
ways to fit the underlying cache architectures. The optimized memory architecture and
program code can improve the performance and energy consumption significantly.
The objective of this thesis is utilize application characteristic so as to achieve sig-
nificant cache performance improvements. Application characteristics used in this the-
sis include basic block execution count profile (branch probability, loop bound), tem-
poral reuse profile and intermediate blocks profile. These application characteristics
144
CHAPTER 9. CONCLUSION 145
are identified through profiling and exploited by our subsequent analytical approach.
In this thesis, we consider both hardware (architecture) and software optimization so-
lutions. For hardware (architecture) solutions, we propose techniques to customize the
instruction cache according to the specific temporal and spatial localities of a given
application. For software solutions, we propose techniques to tailor the program to
underlying instruction cache parameters.
More concretely, the contributions of this thesis are:
• a static program analysis that accurately and efficiently model the cache behavior
of a specific cache configuration.
• an analytical approach that accurately explores cache design space with multiple
cache configurations in a single pass.
• a precise and accurate cache modeling using temporal reuse profile and two static
instruction cache locking algorithms for performance improvement.
• an improved procedure placement algorithm for set associative caches using in-
termediate blocks profile and an algorithm for a neutral code layout with good
portability.
9.2 Future Directions
Though the techniques developed by this thesis are mainly for instruction caches, most
of them can be applied to data cache. First, probabilistic cache state proposed in chap-
CHAPTER 9. CONCLUSION 146
ter 4 and 5 is a general concept which can be used for data cache as well [81] with
special optimizations for space. Second, cache locking techniques in chapter 6 can be
used for data cache, given the corresponding program data reference trace. Finally,
procedure placement technique in chapter 7 can be used for data cache by replacing
procedures with data segments.
With the advent of multi-core architecture, the embedded computing world is mov-
ing into the direction of multiprocessing. This opens up new challenges for the em-
bedded system designers. The main challenges arise from the mapping and scheduling
of parallel tasks, conflict modeling of shared resources such as cache and communica-
tion media, and timing unpredictability caused by cache warm-up due to task migration
and preemption. In this thesis, the techniques we developed are mainly targeted for
a single core. When attempting to extend the local optimization techniques to global
optimization, the interactions among the cores have to be taken into account.
Though multi-core architecture offers better performance and energy saving op-
portunities, the application characteristics are not utilized and incorporated well in the
current hardware and software design flow. Thus, there is a huge gap between the
potential performance that can be offered by multi-core architecture and the way that
they are being used. In this thesis, we consider the software transformations for single
core instruction cache. For multi-core architecture, our goal is to develop compilation
techniques that can optimize high level programs for application-specific multi-core ar-
chitecture by utilizing the knowledge of application and system resources (processor
cores, memory units, communication bandwidth).
Bibliography
[1] 3rd Generation Intel Xscale Microarchitecture Developers’s Manual. Intel, May 2007.
http://www.intel.com/design/intelxscale.
[2] ADSP-BF533 Processor Hardware Reference. Analog Devices, April 2009.
http://www.analog.com/static/imported-files/processor_
manuals/bf533_hwr_Rev3.4.pdf.
[3] Arc International. In http://www.arccores.com, 2005.
[4] ARM Cortex A-8 Technical reference Manual. Arm, Revised March 2004. http://
www.arm.com/products/CPUs/families/ARMCortexFamily.html.
[5] ARM Embedded Processor. In http://www.arm.com, 2005.
[6] ARM1156T2-S Technical reference Manual. Arm, Revised July 2007. http://www.
arm.com/products/CPUs/families/ARM11Family.html.
[7] IBM Systems and Technology Group. PowerPC: IBM Microelectron-
ics. http://www-306.ibm.com/chips/techlib/techlib.nsf/
productfamilies/PowerPC.
[8] NIOS Embedded Processor. In http://www.altera.com, 2005.
147
BIBLIOGRAPHY 148
[9] Xtensa Processor Generator. In http://www.tensilica.com, 2005.
[10] D. H. Albonesi. Selective cache ways: on-demand cache resource allocation. In MICRO
32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchi-
tecture, 1999.
[11] M. Alt et al. Cache behavior prediction by abstract interpretation. In SAS ’96: Proceed-
ings of the Third International Symposium on Static Analysis, 1996.
[12] K. Anand and R. Barua. Instruction cache locking inside a binary rewriter. In CASES
’09: Proceedings of the 2009 international conference on Compilers, architecture, and
synthesis for embedded systems, 2009.
[13] R. Arnold et al. Bounding worst-case instruction cache performance. In RTSS ’94:
Proceedings of the 23rd IEEE Real-Time Systems Symposium, 1994.
[14] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system
modeling. IEEE Computer, 35(2), 2002.
[15] R. Balasubramonian et al. Memory hierarchy reconfiguration for energy and performance
in general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd an-
nual ACM/IEEE international symposium on Microarchitecture, 2000.
[16] T. Ball. Efficiently counting program events with support for on-line queries. ACM
Transactions on Programming Languages and Systems, 16(5), 1994.
[17] S. Bartolini and C. A. Prete. Optimizing instruction cache performance of embedded
systems. ACM Trans. Embed. Comput. Syst., 4(4):934–965, 2005.
BIBLIOGRAPHY 149
[18] L. Benini et al. From architecture to layout: Partitioned memory synthesis for embed-
ded systems-on-chip. In DAC ’01: Proceedings of the 44th annual Design Automation
Conference, 2001.
[19] L. Benini, A. Macii, and M. Poncino. Energy-aware design of embedded memories: A
survey of technologies, architectures, and optimization techniques. ACM Trans. Embed.
Comput. Syst., 2(1):5–32, 2003.
[20] E. Berg and E. Hagersten. Statcache: a probabilistic approach to efficient and accurate
data locality analysis. In ISPASS ’04:Proceedings of the 2004 IEEE International Sym-
posium on Performance Analysis of Systems and Software, 2004.
[21] K. Beyls and E. H. D‘Hollander. Reuse distance as a metric for cache behavior. In Pro-
ceedings of the IASTED International Conference on Parallel and Distributed Computing
and Systems, 2001.
[22] K. Beyls and E. H. D’Hollander. Reuse distance-based cache hint selection. In Euro-Par
’02: Proceedings of the 8th International Euro-Par Conference on Parallel Processing,
2002.
[23] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level
power analysis and optimizations. In ISCA ’00: Proceedings of the 27th annual interna-
tional symposium on Computer architecture, 2000.
[24] B. Buck and J. K. Hollingsworth. An api for runtime code patching. Int. J. High Perform.
Comput. Appl., 14(4), 2000.
BIBLIOGRAPHY 150
[25] A. M. Campoy et al. Cache contents selection for statically-locked instruction caches: An
algorithm comparison. In ECRTS ’05: Proceedings of the 17th Euromicro Conference
on Real-Time Systems, 2005.
[26] C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances.
In ICS ’03: Proceedings of the 17th annual international conference on Supercomputing,
2003.
[27] S. Chatterjee et al. Exact analysis of the cache behavior of nested loops. SIGPLAN Not.,
36(5):286–297, 2001.
[28] C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analy-
sis. SIGPLAN Not., 38(5):245–257, 2003.
[29] J. Edler and M. D. Hill. Dinero iv trace-driven uniprocessor cache simulator. http:
//www.cs.wisc.edu/˜markhill/DineroIV/.
[30] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction cache locking using
worst-case execution paths. In CODES+ISSS ’07: Proceedings of the 5th IEEE/ACM
international conference on Hardware/software codesign and system synthesis, 2007.
[31] C. Ferdinand and R. Wilhelm. On predicting data cache behaviour for real-time systems.
In ACM SIGPLAN Workshop 1998 on Languages, Compilers, and Tools for Embedded
System, 1998.
[32] C. Ferdinand and R. Wilhelm. Fast and efficient cache behavior prediction for real-time
systems. Real-Time Systems, 17(2/3):131–181, 1999.
BIBLIOGRAPHY 151
[33] G. Gebhard and S. Altmeyer. Optimal task placement to improve cache performance. In
EMSOFT ’07: Proceedings of the 7th ACM & IEEE international conference on Embed-
ded software, 2007.
[34] A. Ghosh and T. Givargis. Analytical design space exploration of caches for embedded
systems. In DATE ’03: Proceedings of the conference on Design, Automation and Test
in Europe, 2003.
[35] A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An ana-
lytical approach. ACM Trans. Des. Autom. Electron. Syst., 9(4):419–440, 2004.
[36] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for
analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst., 21(4):703–
746, 1999.
[37] T. Givargis, F. Vahid, and J. Henkel. Fast cache and bus power estimation for parameter-
ized system-on-a-chip design. In DATE ’00: Proceedings of the conference on Design,
automation and test in Europe, 2000.
[38] T. Givargis, F. Vahid, and J. Henkel. System-level exploration for pareto-optimal con-
figurations in parameterized systems-on-a-chip. In ICCAD ’01: Proceedings of the 2001
IEEE/ACM international conference on Computer-aided design, 2001.
[39] N. Gloy et al. Procedure placement using temporal ordering information. In MICRO 30:
Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitec-
ture, 1997.
[40] N. Gloy and M. D. Smith. Procedure placement using temporal-ordering information.
ACM Trans. Program. Lang. Syst., 21(5):977–1027, 1999.
BIBLIOGRAPHY 152
[41] C. Goldfeder. Frequency-based code placement for embedded multiprocessors. In DAC
’05: Proceedings of the 42nd annual Design Automation Conference, 2005.
[42] A. Gordon-Ross et al. A one-shot configurable-cache tuner for improved energy and
performance. In DATE ’07: Proceedings of the conference on Design, automation and
test in Europe, 2007.
[43] A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embed-
ded applications. In DATE ’04: Proceedings of the conference on Design, automation
and test in Europe, 2004.
[44] A. Gordon-Ross, F. Vahid, and N. Dutt. Fast configurable-cache tuning with a unified
second-level cache. In ISLPED ’05: Proceedings of the 1995 international symposium
on Low power design, 2005.
[45] C. Guillon et al. Procedure placement using temporal-ordering information: dealing with
code size expansion. In CASES ’04: Proceedings of the 2004 international conference
on Compilers, architecture, and synthesis for embedded systems, 2004.
[46] M. R. Guthaus et al. Mibench: A free, commercially representative embedded benchmark
suite. In Workshop on Workload Characterization, 2001.
[47] M. S. Haque, A. Janapsatya, and S. Parameswaran. Susesim: a fast simulation strategy
to find optimal l1 cache configuration for embedded systems. In CODES+ISSS ’09: Pro-
ceedings of the 7th IEEE/ACM international conference on Hardware/software codesign
and system synthesis, 2009.
BIBLIOGRAPHY 153
[48] D. Hardy and I. Puaut. WCET analysis of multi-level non-inclusive set-associative in-
struction caches. In RTSS ’08: Proceedings of the 2008 Real-Time Systems Symposium,
2008.
[49] J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative
cache behavior. IEEE Trans. Comput., 48(10):1009–1024, 1999.
[50] A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient procedure mapping using cache line
coloring. SIGPLAN Not., 32(5):171–182, 1997.
[51] J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach.
Morgan Kaufmann Publishers Inc., 2002.
[52] M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches. IEEE Transactions
on Computers, 38(12), 1989.
[53] W. W. Hwu and P. P. Chang. Achieving high instruction cache performance with an
optimizing compiler. SIGARCH Comput. Archit. News, 17(3):242–251, 1989.
[54] A. Janapsatya et al. Instruction trace compression for rapid instruction cache simulation.
In DATE ’07: Proceedings of the conference on Design, Automation and Test in Europe,
2007.
[55] J. Kalamatianos et al. Analysis of temporal-based program behavior for improved in-
struction cache performance. IEEE Trans. Comput., 48(2):168–175, 1999.
[56] J. Kin et al. Power efficient mediaprocessors: design space exploration. In DAC ’99:
Proceedings of the 36th annual ACM/IEEE Design Automation Conference, 1999.
BIBLIOGRAPHY 154
[57] X. F. Li et al. Design space exploration of caches using compressed traces. In ICS
’04:Proceedings of the 18th annual international conference on Supercomputing, 2004.
[58] Y. Li et al. Hardware-software co-design of embedded reconfigurable architectures. In
DAC ’00: Proceedings of the 45th annual Design Automation Conference, 2000.
[59] Y. Li and J. Henkel. A framework for estimation and minimizing energy dissipation of
embedded hw/sw systems. In DAC ’98: Proceedings of the 35th annual Design Automa-
tion Conference, 1998.
[60] Y. Li and W. Wolf. Hardware/software co-synthesis with memory hierarchies. In ICCAD
’98: Proceedings of the 1998 IEEE/ACM international conference on Computer-aided
design, 1998.
[61] Y. S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with in-
struction cache modeling. ACM Trans. Des. Autom. Electron. Syst., 4(3):257–279, 1999.
[62] Y. Liang and T. Mitra. Instruction cache locking using temporal reuse profile. In DAC
’10: Proceedings of the 47th annual Design Automation Conference.
[63] Y. Liang and T. Mitra. Cache modeling in probabilistic execution time analysis. In DAC
’08: Proceedings of the 45th annual Design Automation Conference, 2008.
[64] Y. Liang and T. Mitra. Static analysis for fast and accurate design space exploration
of caches. In CODES+ISSS ’08: Proceedings of the 6th IEEE/ACM/IFIP international
conference on Hardware/Software codesign and system synthesis, 2008.
[65] S. Lim et al. An accurate worst case timing analysis for risc processors. IEEE Trans.
Softw. Eng., 21(7):593–604, 1995.
BIBLIOGRAPHY 155
[66] T. Liu, M. Li, and C. J. Xue. Minimizing WCET for real-time embedded systems via
static instruction cache locking. In RTAS ’09: Proceedings of the 15th IEEE Real-Time
and Embedded Technology and Applications Symposium, 2009.
[67] P. Lokuciejewski, H. Falk, and P. Marwedel. WCET-driven cache-based procedure po-
sitioning optimizations. In ECRTS ’08: Proceedings of the 2008 Euromicro Conference
on Real-Time Systems, 2008.
[68] R. L. Mattson et al. Evaluation techniques for storage hierarchies. IBM Systems Journal,
9(2), 1970.
[69] P. Mishra, M. Mamidipaka, and N. Dutt. Processor-memory coexploration using an ar-
chitecture description language. ACM Trans. Embed. Comput. Syst., 3(1):140–162, 2004.
[70] J. Montanaro et al. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. Digital Tech. J.,
9(1), 1997.
[71] F. Mueller. Timing analysis for instruction caches. Real-Time Syst., 18(2-3):217–247,
2000.
[72] N. Nguyen, A. Dominguez, and R. Barua. Memory allocation for embedded systems with
a compile-time-unknown scratch-pad size. In CASES ’05: Proceedings of the 2005 in-
ternational conference on Compilers, architectures and synthesis for embedded systems,
2005.
[73] M. Palesi and T. Givargis. Multi-objective design space exploration using genetic al-
gorithms. In CODES ’02: Proceedings of the tenth international symposium on Hard-
ware/software codesign, 2002.
BIBLIOGRAPHY 156
[74] P. R. Panda, N. D. Dutt, and A. Nicolau. Architectural exploration and optimization of
local memory in embedded systems. In ISSS ’97:Proceedings of the 10th international
symposium on System synthesis, 1997.
[75] P. R. Panda et al. Data and memory optimization techniques for embedded systems. ACM
Trans. Des. Autom. Electron. Syst., 6(2):149–206, 2001.
[76] S. Parameswaran and J. Henkel. I-copes: fast instruction code placement for embedded
systems to improve performance and energy efficiency. In ICCAD ’01: Proceedings of
the 2001 IEEE/ACM international conference on Computer-aided design, 2001.
[77] D. A. Patterson and J. L. Hennessy. Computer Organization and Design: The Hard-
ware/software Interface. Morgan Kaufmann, 1998.
[78] P. Petrov and A. Orailg˘lu. Towards effective embedded processors in codesigns: cus-
tomizable partitioned caches. In CODES ’01: Proceedings of the ninth international
symposium on Hardware/software codesign, 2001.
[79] K. Pettis and R. C. Hansen. Profile guided code positioning. SIGPLAN Not., 25(6):16–27,
1990.
[80] I. Puaut and D. Decotigny. Low-complexity algorithms for static cache locking in mul-
titasking hard real-time systems. In RTSS ’02: Proceedings of the 23rd IEEE Real-Time
Systems Symposium, 2002.
[81] V. Puranik, T. Mitra, and Y. N. Srikant. Probabilistic modeling of data cache behavior. In
EMSOFT ’09: Proceedings of the seventh ACM international conference on Embedded
software, 2009.
BIBLIOGRAPHY 157
[82] P. Puschner and A. Burns. Guest editorial: A review of worst-case execution-
timeanalysis. Real-Time Syst., 18(2-3):115–128, 2000.
[83] G. Rajaram and V. Rajaraman. A probabilistic method for calculating hit ratios in direct
mapped caches. Journal of Network and Computer Applications, 19(3), 1996.
[84] J. Robertson and K. Gala. Instruction and data cache locking on the e300 processor core.
Freescale Semiconductor, Inc., 2006.
[85] R. Sen and Y. N. Srikant. WCET estimation for executables in the presence of data
caches. In EMSOFT ’07: Proceedings of the 7th ACM & IEEE international conference
on Embedded software, 2007.
[86] W. Shiue and C. Chakrabarti. Memory exploration for low power, embedded systems. In
DAC ’99: Proceedings of the 45th annual Design Automation Conference, 1999.
[87] W. Shiue, S. Udayanarayanan, and C. Chakrabarti. Data memory design and exploration
for low-power embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6(4):553–
568, 2001.
[88] A. Shrivastava, I. Issenin, and N. Dutt. Compilation techniques for energy reduction in
horizontally partitioned cache architectures. In CASES ’05:Proceedings of the 2005 in-
ternational conference on Compilers, architectures and synthesis for embedded systems,
2005.
[89] J. E. W. Steven and P. J. Norman. Cacti: An enhanced cache access and cycle time model.
IEEE Journal of Solid-State Circuits, 31:677–688, 1996.
BIBLIOGRAPHY 158
[90] R. A. Sugumar and S. G. Abraham. Set-associative cache simulation using generalized
binomial trees. ACM Transactions on Computer Systems, 13(1), 1995.
[91] V. Suhendra et al. WCET centric data allocation to scratchpad memory. In RTSS ’05:
Proceedings of the 26th IEEE International Real-Time Systems Symposium, 2005.
[92] V. Suhendra and T. Mitra. Exploring locking & partitioning for predictable shared caches
on multi-cores. In DAC ’08: Proceedings of the 45th annual Design Automation Confer-
ence, 2008.
[93] L. Thiele and R. Wilhelm. Design for timing predictability. Real-Time Syst., 28(2-3):157–
177, 2004.
[94] H. Tomiyama and H. Yasuura. Code placement techniques for cache miss rate reduction.
ACM Trans. Des. Autom. Electron. Syst., 2(4):410–429, 1997.
[95] R. A. Uhlig and T. N. Mudge. Trace-driven memory simulation: a survey. ACM Comput.
Surv., 29(2):128–170, 1997.
[96] A. V. Veidenbaum et al. Adapting cache line size to application behavior. In ICS ’99:
Proceedings of the 13th international conference on Supercomputing, 1999.
[97] X. Vera, B. Lisper, and J. Xue. Data cache locking for higher program predictability. In
SIGMETRICS ’03: Proceedings of the 2003 ACM SIGMETRICS international confer-
ence on Measurement and modeling of computer systems, 2003.
[98] X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard real-time systems. In
RTSS ’03: Proceedings of the 24th IEEE International Real-Time Systems Symposium,
2003.
BIBLIOGRAPHY 159
[99] P. Viana et al. Configurable cache subsetting for fast cache tuning. In DAC ’06: Proceed-
ings of the 43rd annual Design Automation Conference, 2006.
[100] W. H. Wang and J. L. Baer. Efficient trace-driven simulation methods for cache perfor-
mance analysis. ACM Trans. Comput. Syst., 9(3):222–241, 1991.
[101] R. T. White et al. Timing analysis for data caches and set-associative caches. In RTAS
’97: Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium
(RTAS ’97), 1997.
[102] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI ’91: Pro-
ceedings of the ACM SIGPLAN 1991 conference on Programming language design and
implementation, 1991.
[103] Z. Wu and W. Wolf. Iterative cache simulation of embedded cpus with trace stripping. In
CODES ’99:Proceedings of the seventh international workshop on Hardware/software
codesign, 1999.
[104] L. Xue, O. Ozcan, and K. Mahmut. In DAC ’07: Proceedings of the 44th annual Design
Automation Conference.
[105] H. Yang et al. Improving power efficiency with compiler-assisted cache replacement. J.
Embedded Comput., 1(4):487–499, 2005.
[106] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded
systems. SIGARCH Comput. Archit. News, 31(2), 2003.
[107] E. Zitzler, K. Deb, and L. Thiele. Comparison of multiobjective evolutionary algorithms:
Empirical results. Evol. Comput., 8(2):173–195, 2000.
