Adaptive caching for high-performance memory systems by Qureshi, Moinuddin Khalil Ahmed, 1978-
Copyright
by
Moinuddin Khalil Ahmed Qureshi
2007
The Dissertation Committee for Moinuddin Khalil Ahmed Qureshi
certifies that this is the approved version of the following dissertation:
Adaptive Caching for High-Performance Memory Systems
Committee:






Adaptive Caching for High-Performance Memory Systems
by
Moinuddin Khalil Ahmed Qureshi, B.E.; M.S.
DISSERTATION
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN
August 2007
Dedicated to Abba and Ammi.
Acknowledgments
This thesis is dedicated to my loving parents, Khalil Ahmed Qureshi and Hasina
Qureshi, both of whom passed away during the course of my studies. Words are incapable
of describing my feeling of gratitude for them. The value my father placed on education,
even though he had limited access to it, is the single biggestreason in my completing
this PhD. My mother was a constant source of love and encouragement. The strength and
courage I derive from their memories has kept me going forward.
I am grateful for the patience, love, and unconditional support of my siblings: Sha-
heen, Mona, Amina, and Alauddin. I am especially grateful tomy sister Mona for always
being there for me when I needed support. Her unwavering faith in me and her taking care
of things at home at the most delicate times allowed me to do mystudies in the US.
I am indebted to my adviser, Yale Patt, for his influence on my life with both his
teaching and guidance. His EE360N is responsible for much ofwhat I know in computer
architecture. His EE382N motivated me to pursue a PhD. Yale provided the right balance
of freedom and guidance needed for my development as a researcher and as an individual.
His teachings and principles will continue to influence my life for a long time.
My life in graduate school would have been barren had it not been for two members
of Monga family: Vishal Monga and Archna Monga. More than friends they became my
family away from home. Vishal was my roommate for the first four years. He helped me
focus during the most difficult times and made my dark days seem brighter. I was deeply
touched by his friendship, patience, and maturity. During the last year, I had the good
fortune of knowing Archna. Archna taught me that it is possible to balance work and life.
I am grateful for her friendship, understanding, discussion , and yummy parathas. I also
thank other elderly members of the Monga family for their affection and blessing.
v
Members of the HPS research group provided a creative and helpful nvironment
for my studies. I thank them all. Dave Thompson for helping mewith writing and pre-
sentation during the initial years. Francis for his helpfulness, sense of humor, and letting
me steal his pens. Onur for his friendship and feedback on resea ch. Hyesoon Kim for her
comradeship and accompanying me for coffee even though she ju t had one. Aater Sule-
man for “extreme writing”, critiquing my slides, and joining me for Friday prayers. Danny
Lynch and Santosh Srinath for providing mental breaks. PaulRacunas, Robert Chappell,
and Mary Brown for their mentorship. Veynu Narasiman, Jose Joao, Chang Joo Lee, Rus-
tam Miftakhutdinov and Linda Hastings for their friendshipand proof reading my papers.
I thank Jacob Abraham, Derek Chiou, Philip Emma, Sanjay Patel, nd Emmett
Witchel for their time to serve on my dissertation committee. Derek was always available
for discussions and feedback on my research. Sanjay gave useful advice throughout.
My learning in graduate school was enriched by the internships at IBM and Intel. I
thank Tom Puzak for his caring nature, friendship, and mentorship. Paul Racunas for the
brainstorming sessions and for arranging several hikes to White Mountains. Brian Prasky
for making me appreciate the tight constraints that designers operate with. Pradip Bose for
discussions and guidance. Chris Wilkerson, Andy Glew, and Simon Steely for improving
my understanding of caching. And, Joel Emer for his helpful nature, discussions, and
useful advise. I am honored to co-author an ISCA paper with Simon and Joel.
Special thanks to Aamer Jaleel for his friendship, cheerfulness, and helpful nature.
His tolerance for my sense of humor is commendable. His candid reviews increased the
quality of my conference submissions. It was a real pleasureworking with him on the
ISCA paper. Thanks to Kais Majid for his friendship and encouraging me to pursue studies
in US, Mrs. Sharmila Petkar for insisting that I take the GRE exam, Melanie Gulick for al-
ways knowing how the ECE department works, and Prabhat Jha, Suju Rajan, and Arindam
Banerjee for their friendship. Last but not the least, I thank IBM for the PhD fellowship.
vi
Adaptive Caching for High-Performance Memory Systems
Publication No.
Moinuddin Khalil Ahmed Qureshi, Ph.D.
The University of Texas at Austin, 2007
Supervisor: Yale N. Patt
One of the major limiters to computer system performance hasbeen the access to
main memory, which is typically two orders of magnitude slower than the processor. To
bridge this gap, modern processors already devote more thanhalf of the on-chip transistors
to the last-level cache. However, traditional cache designs – developed for small first-level
caches – are inefficient for large caches. Therefore, cache misses are common which results
in frequent memory accesses and reduced processor performance. The importance of cache
management has become even more critical because of the increasing memory latency, in-
creasing working sets of many emerging applications, and decreasing size of cache devoted
to each core due to increased number of cores on a single chip.T is dissertation focuses on
analyzing some of the problems with managing large caches and proposing cost-effective
solutions to improve their performance.
Different workloads and program phases have different locality characteristics that
make them better suited to different replacement policies.This dissertation proposes hybrid
replacement policy that can select from multiple replacement policies depending on which
policy has the highest performance. To implement hybrid replacement with low-overhead,
vii
it shows that cache behavior can be approximated by samplingfew sets and proposes the
concept ofDynamic Set Sampling.
The commonly used LRU replacement policy results in thrashing for memory-
intensive workloads that have a working set bigger than the cache size. This dissertation
shows that performance of memory-intensive workloads can be improved significantly by
changing the recency position where the incoming line is inserted. The proposed mecha-
nism reduces cache misses by 21% over LRU, is robust across a wide variety of workloads,
incurs a total storage overhead of less than two bytes, and does n t change the existing
cache structure.
Modern systems try to service multiple cache misses in parallel. The variation in
Memory Level Parallelism (MLP) causes some misses to be morecostly on performance
than other misses. This dissertation presents the first study on MLP-aware cache replace-
ment and proposes to improve performance by eliminating some f the performance-critical
isolated misses.
Finally, this dissertation also analyzes cache partitioning policies for shared caches
in chip multi-processors. Traditional partitioning policies either divide the cache equally
among all applications or use the LRU policy to do a demand based cache partitioning.
This dissertation shows that performance can be improved ifthe shared cache is partitioned
based on how much the application benefits from the cache, rather than on its demand for
the cache. It proposes a novel low-overhead circuit that candynamically monitor the utility
of cache for any application. The proposed partitioning improves weighted-speedup by





List of Tables xiv
List of Figures xv
Chapter 1. Introduction 1
1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . .. . . 5
Chapter 2. Related Work 6
2.1 Caches: Background and Terminology . . . . . . . . . . . . . . . . .. . . 6
2.2 Related Work in Cache Organization . . . . . . . . . . . . . . . . . .. . . 7
2.2.1 Reducing Conflict Misses . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Reducing Capacity Misses . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Related Work in Improving Cache Management . . . . . . . . . . .. . . . 8
2.3.1 Improving Cache Replacement . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Related Work in Cache Bypassing . . . . . . . . . . . . . . . . . . .9
2.3.3 Related Work in Early Eviction . . . . . . . . . . . . . . . . . . . .9
2.3.4 Cost-Sensitive Cache Management . . . . . . . . . . . . . . . . .10
2.4 Reducing Misses with Prefetching . . . . . . . . . . . . . . . . . . .. . . 10
2.5 Servicing Demand Misses in Parallel . . . . . . . . . . . . . . . . .. . . . 11
ix
Chapter 3. Hybrid Replacement via Dynamic Set Sampling 12
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . .13
3.2.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Hybrid Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Tournament Selection of Replacement Policy . . . . . . . .. . . . . 16
3.3.2 Results for Tournament Selection . . . . . . . . . . . . . . . . .. . 18
3.4 Dynamic Set Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Analytical Model for Dynamic Set Sampling . . . . . . . . . . . 21
3.5 Sampling Based Adaptive Replacement . . . . . . . . . . . . . . . .. . . . 22
3.5.1 Leader Set Selection Mechanism . . . . . . . . . . . . . . . . . . .24
3.5.2 Hardware Cost of SBAR . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Comparison of TSEL-global and SBAR . . . . . . . . . . . . . . . .25
3.6.2 Effect of Number of Leader Sets on SBAR . . . . . . . . . . . . . .26
3.6.3 SBAR selection between LRU and Random replacement . . .. . . 27
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 4. Adaptive Insertion Policies 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Static Insertion Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Analysis with Cyclic Reference Model . . . . . . . . . . . . . . 35
4.3.2 Case Studies of Memory-Intensive Thrashing Workloads . . . . . . . 36
4.3.2.1 The mcf benchmark: . . . . . . . . . . . . . . . . . . . . . 37
4.3.2.2 The art benchmark: . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2.3 The health benchmark: . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Case Study of a Memory-Intensive LRU-Friendly Workload . . . . . 41
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Dynamic Insertion Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 The DIP-Global Mechanism . . . . . . . . . . . . . . . . . . . . . 44
x
4.4.2 The DIP-IDSS Mechanism . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Analytical Model for IDSS . . . . . . . . . . . . . . . . . . . . . . 46
4.4.4 Dedicated Set Selection Policy . . . . . . . . . . . . . . . . . . .. . 50
4.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.6 Dynamic Adaptation of DIP to Application Behavior . . .. . . . . 52
4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.1 Varying the Cache Size . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.2 Bypassing Instead of Inserting at LRU Position . . . . . .. . . . . 55
4.5.3 Impact on System Performance . . . . . . . . . . . . . . . . . . . . 56
4.5.4 Estimation of Hardware Overhead and Design Changes . .. . . . . 56
4.5.5 Interaction with Prefetching . . . . . . . . . . . . . . . . . . . . 58
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.1 Alternative Cache Replacement Policies . . . . . . . . . . .. . . . . 58
4.6.2 Related Work in Hybrid Replacement . . . . . . . . . . . . . . . .. 59
4.6.3 Related Work in Paging Domain . . . . . . . . . . . . . . . . . . . . 60
4.6.4 Related Work in Cache Bypassing and Early Eviction . . .. . . . . 61
4.6.5 Related Work in Prefetching . . . . . . . . . . . . . . . . . . . . . .62
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 5. MLP-Aware Cache Replacement 64
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.1 Not All Misses are Created Equal . . . . . . . . . . . . . . . . . . .64
5.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Computing MLP-Based Cost . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2 Distribution ofmlp-cost . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.3 Predictability of themlp-cost metric . . . . . . . . . . . . . . . . . 73
5.4 The Design of an MLP-Aware Cache Replacement Scheme . . . .. . . . . 75
5.4.1 The Linear (LIN) Policy . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 Results for the LIN Policy . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Cost-Sensitive Hybrid Replacement . . . . . . . . . . . . . . . . .. . . . 81
xi
5.5.1 Cost-Sensitive Tournament Selection of ReplacementPolicy . . . . . 81
5.5.2 Sampling Based Adaptive Replacement . . . . . . . . . . . . . .83
5.5.3 Results for the SBAR Mechanism . . . . . . . . . . . . . . . . . . . 84
5.5.4 Effect of Leader Set Selection Policies and DifferentNumber of
Leader Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Ammp: A Case Study for Dynamic Adaptation of SBAR . . . . .. . 86
5.6.2 Hardware Cost of MLP-Aware Replacement . . . . . . . . . . . .. 88
5.6.3 MLP-Aware Replacement using Existing Cost-Sensitive Replace-
ment Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 6. Utility Based Partitioning of Shared Caches 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . .95
6.3 Utility-Based Cache Partitioning . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Utility Monitors (UMON) . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.3 Reducing Storage Overhead Using DSS . . . . . . . . . . . . . . .101
6.3.4 Analytical Model for Dynamic Set Sampling . . . . . . . . . . . 103
6.3.5 Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .104
6.3.6 Changes to Replacement Policy . . . . . . . . . . . . . . . . . . . .105
6.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . .106
6.4.1 Multicore System Configuration . . . . . . . . . . . . . . . . . . .106
6.4.2 Multicore Performance Metrics . . . . . . . . . . . . . . . . . . .106
6.4.3 Multi-programmed Workloads . . . . . . . . . . . . . . . . . . . . .107
6.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
6.5.1 Performance on Weighted Speedup Metric . . . . . . . . . . . .. . 109
6.5.2 Performance on Throughput Metric . . . . . . . . . . . . . . . . .. 111
6.5.3 Evaluation on Fairness Metric . . . . . . . . . . . . . . . . . . . .. 112
6.5.4 Phase-Based Adaptation of UCP . . . . . . . . . . . . . . . . . . . 113
6.5.5 Effect of Varying the Number of Sampled Sets . . . . . . . . .. . . 115
6.5.6 Hardware Overhead of UCP . . . . . . . . . . . . . . . . . . . . . . 116
xii
6.6 Scalable Partitioning Algorithm . . . . . . . . . . . . . . . . . . .. . . . . 117
6.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6.2 The Lookahead Algorithm . . . . . . . . . . . . . . . . . . . . . . . 119
6.6.3 Result for Partitioning Algorithms . . . . . . . . . . . . . . .. . . . 122
6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.1 Related Work in Cache Partitioning . . . . . . . . . . . . . . . .. . 123
6.7.2 Related Work in Cache Organization . . . . . . . . . . . . . . . .. 125
6.7.3 Related Work in Memory Allocation . . . . . . . . . . . . . . . . .125
6.7.4 Related work in SMT . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Chapter 7. Conclusions and Future Work 127
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.1 Applications of Dynamic Set Sampling . . . . . . . . . . . . . .. . 129
7.2.2 Region-Aware Cache Management . . . . . . . . . . . . . . . . . . 129
7.2.3 Prefetching-Aware Cache Management . . . . . . . . . . . . . .. . 129
7.2.4 MLP-Aware Microarchitecture and Memory System . . . . .. . . . 130
7.2.5 Extensions of Cache Partitioning . . . . . . . . . . . . . . . . .. . 130
Appendix 131
Appendix 1. Proposed Techniques on Remaining SPEC Benchmarks 132
1.1 Hybrid Replacement via Dynamic Set Sampling . . . . . . . . . .. . . . . 132
1.2 Adaptive Insertion Policies . . . . . . . . . . . . . . . . . . . . . . . . 133





3.1 Baseline system configuration . . . . . . . . . . . . . . . . . . . . . .. . 14
3.2 Benchmark summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Storage overhead of SBAR. . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.1 Hit Rate for LRU, OPT, LIP, and BIP under Cyclic ReferenceModel . . . . 35
4.2 Comparison of Replacement Policies . . . . . . . . . . . . . . . . .. . . . 60
5.1 Repeatability ofmlp-cost . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Quantization ofmlp-cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Multicore System Configuration. . . . . . . . . . . . . . . . . . . . . . 106
6.2 Multi-programmed Workload Summary . . . . . . . . . . . . . . . . . 108
6.3 Storage Overhead of a UMON circuit with 32 Sets . . . . . . . . .. . . . 116
1.1 Compulsory misses for the remaining SPEC benchmarks . . .. . . . . . . 132
1.2 MPKI with Hybrid Replacement on Remaining SPEC Benchmarks . . . . 132
1.3 MPKI with LRU and DIP on Remaining SPEC Benchmarks . . . . . .. . 133
1.4 IPC with LRU, LIN, and SBAR on Remaining SPEC Benchmarks .. . . . 133
xiv
List of Figures
3.1 Comparison of replacement policies: LRU and LFU . . . . . . .. . . . . . 13
3.2 Tournament selection of replacement policies for a single set. . . . . . . . . 17
3.3 TSEL-global mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Comparison of replacement policies: LRU, LFU, and Selecting between
LRU and LFU using TSEL-global. . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Reducing ATD overhead via Dynamic Set Sampling. . . . . . . .. . . . . 20
3.6 Analytical Bounds on Number of Leader Sets. . . . . . . . . . . .. . . . . 22
3.7 Sampling Based Adaptive Replacement . . . . . . . . . . . . . . . .. . . 23
3.8 Comparison of TSEL-global and SBAR. . . . . . . . . . . . . . . . . .. . 25
3.9 Effect of Number of Leader Sets on SBAR. . . . . . . . . . . . . . . .. . 26
3.10 Comparisons of LRU, Random, and SBAR (LRU+RND) . . . . . . .. . . 27
4.1 Percentage of Zero Reuse Lines for the Baseline 1MB 16-way L2 cache . . 33
4.2 Miss-causing instructions from the mcf benchmark . . . . .. . . . . . . . 37
4.3 MPKI vs. cache size for mcf . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Miss-causing instructions from the art benchmark . . . . .. . . . . . . . . 39
4.5 MPKI vs. cache size for art . . . . . . . . . . . . . . . . . . . . . . . . . .39
4.6 Miss-causing instruction from the health benchmark . . .. . . . . . . . . . 40
4.7 MPKI vs. cache size for health . . . . . . . . . . . . . . . . . . . . . . .. 41
4.8 MPKI vs. cache size for swim . . . . . . . . . . . . . . . . . . . . . . . . 42
4.9 Comparison of Static Insertion Policies . . . . . . . . . . . . .. . . . . . 43
4.10 Implementations of Dynamic Insertion Policy . . . . . . . .. . . . . . . . 45
4.11 P(Best) from Gaussian Curve . . . . . . . . . . . . . . . . . . . . . . .. . 49
4.12 Analytical Bounds for IDSS . . . . . . . . . . . . . . . . . . . . . . . . 50
4.13 Comparison of Dynamic Insertion Policies . . . . . . . . . . .. . . . . . . 52
4.14 Dynamic adaptation of DIP to program behavior . . . . . . . .. . . . . . . 53
4.15 Comparison of LRU and DIP for different cache size . . . . .. . . . . . . 54
4.16 Effect of Bypassing on DIP . . . . . . . . . . . . . . . . . . . . . . . . .55
xv
4.17 IPC improvement with DIP . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.18 Hardware changes for implementing DIP . . . . . . . . . . . . . .. . . . 57
4.19 Interaction of Insertion Policy with Prefetching . . . . . . . . . . . . . 59
5.1 The drawback of not including MLP information in replacement decisions. 66
5.2 Distribution ofmlp-cost for baseline processor . . . . . . . . . . . . . . . 72
5.3 Microarchitecture for MLP-aware cache replacement . . .. . . . . . . . . 75
5.4 IPC improvement with LIN (λ) asλ is varied. . . . . . . . . . . . . . . . . 78
5.5 Distribution ofmlp-cost for baseline and LIN. . . . . . . . . . . . . . . . 80
5.6 Cost-sensitive Tournament Selection for a single set. .. . . . . . . . . . . 82
5.7 Cost-sensitive SBAR selection between LIN ad LRU . . . . . .. . . . . . 83
5.8 IPC improvement with the SBAR mechanism. . . . . . . . . . . . . .. . . 84
5.9 Performance impact of SBAR for different leader set selection policies and
different number of leader sets. . . . . . . . . . . . . . . . . . . . . . . .85
5.10 Comparison of LRU, LIN, and SBAR for the ammp benchmark .. . . . . 87
5.11 MLP-aware replacement using different cost-sensitive policies. . . . . . . . 89
6.1 A Case for Utility Based Cache Partitioning . . . . . . . . . . .. . . . . . 93
6.2 MPKI and CPI for Low Utility Benchmarks. . . . . . . . . . . . . . .. . . 95
6.3 MPKI and CPI for High Utility Benchmarks. . . . . . . . . . . . . .. . . 96
6.4 MPKI and CPI for Saturating Utility Benchmarks. . . . . . . . . . . . . 97
6.5 Framework for Utility-Based Cache Partitioning . . . . . . . . . . . . . 99
6.6 Tracking utility information using stack property . . . .. . . . . . . . . . 100
6.7 Utility Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.8 Bounds on Number of Sampled Sets . . . . . . . . . . . . . . . . . . . . .104
6.9 Performance of LRU, Half-and-Half, and UCP. . . . . . . . . . .. . . . . 109
6.10 LRU (left bar) vs. UCP (right bar) on throughput metric.. . . . . . . . . . 112
6.11 LRU, Half-and-Half, and UCP on fairness metric. . . . . . .. . . . . . . . 113
6.12 UCP vs. Static Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 114
6.13 Effect of Number of Sampled Sets on UCP. . . . . . . . . . . . . . .. . . 115
6.14 Benchmarks with non-convex utility curves . . . . . . . . . .. . . . . . . 119
6.15 Comparison of Partitioning Algorithms . . . . . . . . . . . . . . . . . 122





Over the past two decades, processor speeds have increased at a much faster rate
than DRAM speeds. Consequently, the number of processor cycles it takes to access main
memory has also increased. Current high performance process rs have memory access
latency of well over 300 cycles, and trends [84] indicate that t is number will only in-
crease in the future. The growing disparity between processor speed and memory speed is
popularly referred in the architecture community as theMemory Wall[86]. Main memory
accesses affect processor performance adversely. Therefor , current processors use caches
to reduce the number of memory accesses. A cache hit providesfast access to recently
accessed data. However, if there is a cache miss at the last level cache, a memory access is
initiated and the processor is stalled for hundreds of cycles [84][34]. Therefore, to sustain
high performance, it is important to reduce cache misses.
The design of the first level cache is heavily constrained by access time. Further-
more, with out-of-order execution, current processors areabl to tolerate some of the first
level cache misses [35]. The stringent requirement of fast access time and the limited po-
tential for improvement because of out-of-order executionhas led to simpler designs for
the first level cache. On the other hand, the design of the second level cache1 is constrained
more by the available on-chip transistors and less by the accss time to the cache. Moreover,
1For simplicity, we assume a two level cache hierarchy throughout this proposal. However, the problems,
discussion, and solutions can easily be extended to all the non-primary caches in a multi-level cache hierarchy.
1
the locality characteristics of the second level cache access stream are different from the
first level cache access stream. However, current processors u e a traditional management
policies for the second level cache without paying attention o the locality characteristics
of the access stream visible to the second level cache. With traditional designs, the second
level cache is not used efficiently which leads to a large number of cache misses and lower
performance. Performance can be improved with designs thatare able to better exploit the
locality characteristics and the design constraints visible to the second level cache.
Different workloads and program phases have different locality characteristics that
make them better suited to different replacement policies.However, traditional cache de-
signs decide the replacement policy at design time and use that policy for all applications
and phases. Cache performance can be improved if the cache management can select from
multiple replacement policies depending on which policy performs better for the given ap-
plication or phase. This dissertation provides a practicalframework to implement hybrid
replacement that can choose the best performing policy at runtime.
Implementing hybrid replacement in a straight-forward manner requires tracking
the information for competing replacement policies on a per-set basis using extra tags. The
hardware overhead for extra tags for all sets can be prohibitively expensive. Small caches
have few tens of sets, however, large caches typically have hundreds or thousands of sets.
The hardware overhead for tracking information about replacement policies can be reduced
by using the key insight that cache behavior can be approximated with high accuracy by
sampling few sets. This mechanism, calledDynamic Set Samplingenables allows cost-
effective optimization for several caching policies.
The commonly used LRU replacement policy causes thrashing for memory-intensive
workloads that have a working set greater than the availablecache size. In fact with tradi-
tional LRU policy, more than 60% of the lines installed in thesecond-level cache remain
unused between insertion and eviction. Thus, most of the inserted lines occupy cache space
2
without ever contributing to cache hits. When the working set i larger than the available
cache size, cache performance can be improved by retaining some fraction of the work-
ing set long enough that at least that fraction of the workingset contributes to cache hits.
This dissertation shows that performance of memory-intensiv workloads can be improved
significantly by changing the recency position where the incoming line is inserted.
Performance loss due to long-latency memory accesses can bereduced by ser-
vicing multiple memory accesses concurrently. The notion of generating and servicing
long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is
not uniform across cache misses – some misses occur in isolation while some occur in
parallel with other misses. Isolated misses are more costlyn performance than parallel
misses. Unfortunately, traditional cache replacement algorithms are not aware of the dis-
parity in performance loss that results from the variation in MLP among cache misses.
Cache replacement, if made MLP-aware, can improve performance by reducing the num-
ber of performance-critical isolated misses. This dissertation proposes a framework for
MLP-aware cache replacement by using a run-time technique to compute the MLP-based
cost for each cache miss. It then describes a simple cache replacement mechanism that
takes both MLP-based cost and recency into account.
Finally, this dissertation also investigates the problem of partitioning a shared cache
between multiple concurrently executing applications. The commonly used LRU policy
implicitly partitions a shared cache on a demand basis, givin more cache resources to the
application that has a high demand and fewer cache resourcesto the application that has a
low demand. However, a higher demand for cache resources does not always correlate with
a higher performance from additional cache resources. It isbeneficial for performance to
invest cache resources in the application that benefits morefrom the cache resources rather
than in the application that has more demand for the cache resou ces. This dissertation
proposesutility-based cache partitioning (UCP), a low-overhead, runtime mechanism that
3
partitions a shared cache between multiple applications depending on the reduction in cache
misses that each application is likely to obtain for a given amount of cache resources. The
proposed mechanism monitors each application at runtime using a novel, cost-effective,
hardware circuit that requires less than 2kB of storage. Theinformation collected by the
monitoring circuits is used by a partitioning algorithm to decide the amount of cache re-
sources allocated to each application.
1.2 Thesis Statement
As locality characteristics and design constraints of large caches are different from
first-level caches, traditional cache designs – developed for small first-level caches – are
inefficient for large caches. Simple and cost-effective changes to cache management can
substantially improve the performance of large caches.
1.3 Contributions
This dissertation makes the following contributions:
1. This dissertation presents hybrid replacement policy that can select from multiple
replacement policies depending on which policy has the highest performance. To
implement hybrid replacement with low-overhead, it shows that cache behavior can
be approximated by sampling few sets and proposes the concept of Dynamic Set
Sampling.
2. This dissertation shows that performance of memory-intensive workloads can be im-
proved significantly by changing the recency position wherethe incoming line is
inserted. The proposed mechanism reduces cache misses by 21% over LRU, is ro-
bust across a wide variety of workloads, incurs a total storage overhead of less than
4
two bytes, and does not change the existing cache structure.
3. This dissertation presents the first study on MLP-aware cache replacement and pro-
poses to improve performance by eliminating some of the performance-critical iso-
lated misses. It describes a hardware mechanism to measure MLP-based cost at
runtime and used this cost to drive a cost-sensitive replacement policy.
4. This dissertation shows that performance of shared caches an be improved if the
shared cache is partitioned based on how much the competing applic tion benefits
from the cache, rather than on its demand for the cache. It proposes a novel low-
overhead circuit that can dynamically monitor the utility of cache for any application.
The proposed partitioning improves weighted-speedup by 11%, throughput by 17%
and fairness by 11% on average compared to LRU. As optimal partitioning is NP
hard, this dissertation also proposes a low time-complexity algorithm that is scalable
to many cores and performs similar to searching through all the exponential number
of possible partitions.
1.4 Dissertation Organization
This dissertation is divided into seven chapters. Related work in discussed in Chap-
ter 2. Chapter 3 describes cost-effective hybrid replacement policies. Chapter 4 analyzes
insertion policies for high performance caching. MLP-Aware cache replacement in pro-
posed in Chapter 5. Chapter 6 discusses utility based partitioning of shared caches. Finally,




Addressing the memory wall problem has been a hot topic of research in the com-
puter architecture community for the past several years. Consequently, there have been sev-
eral proposals for reducing the penalty of memory accesses.Thi chapter describes some
of the work that relates to the memory wall in general and cache management in particular.
Related work for the specific problems of cache management studied in this dissertation is
discussed in detail in the corresponding chapters so that qualitative and quantitative com-
parison can be made with the proposed techniques.
2.1 Caches: Background and Terminology
Caching is one of the most fundamental concepts in computing. Processor caches
were introduced as early as the mid sixties [83] to bridge thediff rence in speed between
the processor and memory. Smith [71] did the initial study onperformance of cache mem-
ory as a function of the three basic parameters of the cache: size, associativity and linesize.
Hill [26] classified cache misses into the popular 3C model: compulsory misses, conflict
misses, and capacity misses. Compulsory misses correspondto the number of cache lines
in the trace, conflict misses are the misses that would be reduced by increasing the associa-
tivity of the cache to fully associative, and the remaining misses are capacity misses. This
model however does not taking into account the variation in misses because of changing
the replacement policy. Puzak [61] analyzed cache replacement algorithms for processor
caches and proposed theshadow directorymechanism for improving cache replacement.
6
2.2 Related Work in Cache Organization
During the initial years, on-chip caches were small and determined the cycle time of
the processor. To keep the access time small, the cache was typicall configured as a direct-
mapped structure. A substantial body of research has investgated caching optimizations to
reduce conflict misses. Recently, researchers have also started o focus on cache organiza-
tions that increase the effective capacity of the cache. This section describes some of the
cache organization related work that focuses on reducing cofli t and capacity misses.
2.2.1 Reducing Conflict Misses
Memory accesses in general purpose applications are non-uniformly distributed
across the sets in a cache [57] [37]. This non-uniformity creates a heavy demand on some
sets, which can lead to conflict misses, while other sets remain underutilized. Substantial
research effort has been put forth to address this problem for direct-mapped caches. Victim
caches [33] are small, fully-associative buffers that provide limited additional associativity
for heavily utilized entries in a direct-mapped cache. The hash-rehash cache [1], the adap-
tive group-associative cache [57], and the predictive sequential-associative cache [8] trade
variable hit latency for increased associativity. With these schemes, if the first attempt to
access the cache results in a miss, the hash function that maps addresses to sets is changed,
and a new cache access is initiated. This process may be repeated multiple times until ei-
ther the data is found or a miss is detected. These techniqueswer proposed for first level
direct-mapped caches, and their effectiveness reduces as associ tivity increases due to the
inherent performance benefit of increased associativity.
Reducing conflict misses for large secondary caches have also been studied in the
literature. Hallnor et al. [25] proposed the Indirect IndexCache (IIC) as a mechanism to
achieve full-associativity through software management.Qureshi et al [64] proposed the
Variable way set associative (V-Way)cache to achieve the global replacement benefits of a
7
fully associative cache while maintaining the constant hitlatency of a set-associative cache.
2.2.2 Reducing Capacity Misses
Capacity misses can be reduced by increasing the cache size.Sev ral studies [87][4]
have looked at compression techniques for increasing the effective capacity of the cache.
The key idea in all the proposed compression schemes is that some values occur much
more frequently than others and hence can be stored in few bits. It is desirable that the
compression scheme has fast compression and decompressionlatency. If the performance
loss from decompression is more than the capacity benefits obtained from compression,
then compression can reduce performance. Adaptive cache compression was proposed by
Alameldeen et al. [3] to perform compression only if it is likely to improve performance.
An orthogonal approach to increase cache capacity is to filter unused words in the cache
lines. The recently proposedLine Distillation technique [62] filters unused words in the
cache lines once the lines have crossed a pre-defined position in the LRU stack. There are
several predictor-based techniques [31][40][11][60] that provides spatial filtering.
2.3 Related Work in Improving Cache Management
2.3.1 Improving Cache Replacement
Current caches typically use either LRU or some approximation of LRU [73] as
the replacement policy. An ideal replacement scheme can minimize the number of misses
by choosing the victim that will be accessed the farthest in the future [7]. Although, such
a scheme is impossible to build, it shows that there is significant room for improvement
over the LRU replacement policy. Several proposals [50] [6][70 [42] [54] have looked at
improving cache replacement by taking into account both recency as well as frequency.
8
2.3.2 Related Work in Cache Bypassing
It is not beneficial to install a line that is never referencedwhile it is in the cache.
McFarling[47] proposed dynamic exclusion to reduce conflict misses in a direct-mapped
instruction cache. However, the proposed scheme is not easily applicable to data caches.
Gonzalez et al. [24] proposed using alocality prediction tableto bypass access patterns
that are likely to pollute the cache. Their technique works well only for predictable access
patterns, which are usually found in numeric code. Tyson et al. [81] looked at static and
dynamic techniques to mark the load instructions asC cheable/Not Allocatable(CNA).
All the data references generated by the CNA instructions are not allocated in the cache.
Rivers and Davidson [66] looked at reducing the conflict misses in a direct mapped cache
by bypassing lines with low temporal locality into a small fuy associative cache. Johnson
[31] describes a technique to track the reuse behavior of cache lines by keeping the access
information in a Macro Address Table. If the incoming line islikely to have less reuse than
the line it will evict, then the incoming line is not installed in the cache. Their studies with
direct-mapped caches show that cache bypassing can help performance by reducing both
misses and bus traffic.
2.3.3 Related Work in Early Eviction
Another approach to address the problem of low locality lines is to evict them early.
Wong et al. [85] proposed modified LRU policies for early eviction of lines with low
temporal locality by marking some instructions as low temporal instructions. Wang et al.
[82] looked at compiler techniques to help the replacement engine by tagging cache lines
with Evict-mebits. Another area of research has been to predict the last touch to a cache
line [41] [43]. After the last touch is encountered, the linecan either be turned off [36] or
be used to store prefetched data [41].
9
2.3.4 Cost-Sensitive Cache Management
Another area of research is to make the cache management aware of the variation in
performance impact of cache misses. Srinivasan et al. [75] analyzed the criticality of load
misses for out-of-order processors. Based on the criticaliy nalysis, Srinivasan et al. also
investigated criticality based caches [74] and concluded that the working set of critical loads
is large, and therefore it is better to have locality based caching. Cost-sensitive replacement
for on-chip caches was investigated by Jeong et al. [30]. They proposed variations of
LRU that takecost (any numerical property associated with a cache block) intoaccount.
They evaluated their cost-sensitive policies for Non-Uniform Memory Access (NUMA)
systems by taking the bank access latency (local-hit vs remot -hit) as thecostparameter.
They showed that significant performance improvements are possible when there is huge
variation in thecostof different cache blocks.
2.4 Reducing Misses with Prefetching
A cache miss can be avoided if the requested line is brought into the cache ahead
of its use. Several proposals have investigated run-time prefetching techniques using spe-
cialized hardware [12] [17] [22] [32] [5] [53]. The central idea of these schemes is to store
information about recent memory accesses that missed the cache and detect a pattern in
the miss address stream. If the compiler can predict the address of the desired data before
it is likely to be referenced, then it can insert software prefetches in the code [59]. Data
prefetching is extremely beneficial when the access stream has either predictable pattern or
the compiler can accurately predict the addresses of the data ahead of its use. While nu-
merical programs contain regular access patterns that are easy to prefetch, integer programs
have very irregular access patterns that are much more difficult to predict. If the prefetches
generated by the prefetcher are not accessed then the prefetched lines can cause cache pol-
lution and bandwidth contention, which can lead to reduced prformance. Palacharla et
10
al. [56] investigated using the stream buffer instead of thesecondary cache for streaming
numerical programs that have a working set larger than the cache size.
2.5 Servicing Demand Misses in Parallel
Another method to reduce the performance impact of cache misses is to service
the misses in parallel. The notion of generating and servicing multiple outstanding cache
misses in parallel is calledMemory Level Parallelism(MLP) [23]. Kroft [39] proposed
lockup free caches to allow instruction processing under a cache miss. Out-of-order exe-
cution engines inherently improve MLP by continuing to execut instructions after a long-
latency miss. Instruction processing stops only when the instruction window becomes full.
If additional misses are encountered before the window becom s full, then these misses are
serviced in parallel with the stalling miss. The effectiveness of an out-of-order engine’s
ability to increase MLP is limited by the instruction windowsize.
Runahead execution [51] overcomes the limitation posed by the instruction win-
dow size. When the instruction window becomes full due to a long-latency cache miss,
a runahead execution engine removes the stalling instruction from the window and pro-
cesses instructions speculatively such that long-latencyca he misses and their dependents
do not stall the window. Instruction processing continues speculatively with the sole aim
of generating additional (useful) misses to be serviced in parallel with the stalling miss.
Chou et al. [16] analyzed the effectiveness of different microarchitectural tech-
niques such as out-of-order execution, value prediction [89], and runahead execution on
increasing MLP. They concluded that microarchitecture optimizations can have a profound
impact on increasing MLP. MLP can also be improved at the compiler level. Read miss
clustering [55] is a compiler technique in which the compiler r orders load instructions
with predicable access patterns to improve memory parallelism.
11
Chapter 3
Hybrid Replacement via Dynamic Set Sampling
Different workloads and program phases have varying locality characteristics that
make them better suited to different replacement policies.However, traditional cache de-
signs decide the replacement policy at design time and use that policy for all applications
and phases. Cache performance can be improved if the cache management can select from
multiple replacement policies depending on which policy performs better for the given
application or phase. This chapter provides a practical framework to implement hybrid
replacement that can choose the best performing policy at runtime.
3.1 Motivation
Figure 3.1 compares the Misses Per Thousand Instructions (MPKI) for the baseline
1MB 16-way L2 cache for two replacement policies: Least Recently Used (LRU) and
Least Frequently Used (LFU). The details about other parameters of the experiment are
discussed in section 3.2. LFU replacement reduces MPKI by more than 10% compared
to LRU for seven out of the fifteen benchmarks. Benchmarks such as art and galgel have
less than 50% of the misses with LFU compared to LRU. However,LFU can substantially
increase misses for LRU-friendly benchmarks such as equake, p rser, mgrid, and swim. For
example, LFU more than doubles the MPKI for parser and swim. Thus, neither of the two
policies, LRU and LFU, perform well across all benchmarks. Amechanism that selects the




















































































Figure 3.1: Comparison of replacement policies: LRU and LFU
3.2 Experimental Methodology
3.2.1 Configuration
Table 3.1 shows the parameters of the baseline configurationused in our studies.
We use an in-house execution-driven simulator that models th alpha ISA. The processor
core is 8-wide issue, out-of-order, with 128-entry reservation station. The 128-entry store
buffer prevents the processor for stalling from store-misses unless the store buffer is full.
Because our study deals with the memory system we model the memory system in detail.
DRAM bank conflicts and bus queuing delays are modeled. The bas line L2 cache is 1MB
in size and is organized as a 16-way set-associative structure. Unless stated otherwise, all
caches use LRU policy for replacement decisions. For experiments with LFU replacement,
the LFU policy is implemented by associating a five-bit frequncy counter with each cache
line. When a cache line is installed, the frequency counter associated with that line is
13
initialized to 0. The frequency counter is incremented at each ccess to the line. When
the frequency counter of a line reaches its maximum value, the frequency counter of all
the lines in that set is halved. On a cache miss, the line that has t e lowest value of the
frequency counter in the miss-causing set is identified as the victim. Ties for the lowest
value of frequency counter are broken randomly. Unless stated o herwise, MPKI numbers
are obtained using a trace-driven cache simulator to reducesimulation time.
Table 3.1: Baseline system configuration
Pipeline 8 wide, out-of-order, with 128 entry reservation station;
Branch Predictor 64 kB hybrid branch predictor with 4k-entry BTB
minimum branch misprediction penalty of 15 cycles.
Instruction Cache 16kB, 64B line-size 4-way with LRU replacement, 2-cycle access.
Data Cache 16kB, 64B line-size,4-way with LRU replacement, 2-cycle access.
Unified 1MB, 64B line-size, 16-way with LRU replacement,
L2 Cache 15-cycle hit, 32-entry MSHR, 128-entry store buffer.
Memory 32 DRAM banks; 400-cycle access latency;
bank conflicts modeled; maximum 32 outstanding requests;
Bus 16B-wide split-transaction bus at 4:1 frequency ratio.
queuing delays modeled
3.2.2 Benchmarks
We use SPEC CPU2000 benchmarks compiled for the Alpha ISA with the-fast
optimizations and profiling feedback enabled. For each benchmark, a representative slice of
250M instructions was obtained with a tool we developed using the SimPoint [58] method-
ology. For all benchmarks, except apsi, ther ference input set is used. For apsi, the
train input set is used.
Because cache replacement does not affect the number of compulsory misses, bench-
marks that have a high percentage of compulsory misses are unlik ly to benefit from im-
provements in cache replacement algorithms. Therefore, detailed studies are performed
14
only for benchmarks for which approximately 50% or fewer misse are compulsory misses.
Key results for the 11 SPEC benchmarks excluded from the detailed study will be shown
in Appendix A. Table 3.2 shows the type, the fast-forward interval (FFWD), the number
of L2 misses, and the percentage of compulsory misses for each benchmark.
Table 3.2: Benchmark summary (B = Billion)
Name Type FFWD MPKI Compulsory Misses
art FP 18.25B 38.7 0.5%
mcf INT 14.75B 136 1.8%
twolf INT 30.75B 3.48 2.9%
vpr INT 60B 2.16 4.3%
facerec FP 111.75B 3.66 4.8%
ammp FP 4.75B 2.83 5.0%
galgel FP 14B 5.34 5.9%
equake FP 26.25B 18.4 14.2%
bzip2 INT 2.25B 2.4 14.8%
parser INT 66.25B 1.57 20.0%
sixtrack FP 8.5B 0.42 20.7%
apsi FP 3.25B 0.32 21.4%
lucas FP 2.5B 16.2 41.6%
mgrid FP 3.5B 7.73 46.6%
swim FP 3.5B 23.0 50.4%
3.3 Hybrid Replacement
For some benchmarks LRU has fewer misses than LFU and for someLFU has fewer
misses than LRU. We want a mechanism that can dynamically choose the replacement
policy that has the fewest misses. A straightforward methodof doing this is to implement
both LFU and LRU in two additional tag directories (note thatd a lines are not required
to estimate the performance of replacement policies) and tokeep track of which of the two
policies is doing better. The main tag directory of the cachean select the policy that is
15
giving the lowest number of cache misses. In fact, a similar technique of implementing
multiple policies and dynamically choosing the best performing policy is well understood
for hybrid branch predictors [49]. However, to our knowledg, no previous research has
looked at dynamic selection of replacement policy by implementing multiple replacement
schemes concurrently. Part of the reason is that the hardware overhead of implementing
two or more additional tag directories, each the same size asthe tag directory of the main
cache, is expensive. To reduce this hardware overhead, we provide a novel, cost-effective
solution that makes hybrid replacement practical. We explain our selection mechanism
before describing the final cost-effective solution.
3.3.1 Tournament Selection of Replacement Policy
Let MTD be the main tag directory of the cache. For facilitating hybrid replace-
ment, MTD is capable of implementing both LFU and LRU. MTD is appended with two
Auxiliary Tag Directories (ATDs): ATD-LFU and ATD-LRU. Both ATD-LFU and ATD-
LRU have the same associativity as MTD. ATD-LFU implements only the LFU policy, and
ATD-LRU implements only the LRU policy. A saturating counter (PSEL) keeps track of
which of the two ATDs is doing better. The access stream visible to MTD is also fed to
both ATD-LFU and ATD-LRU. Both ATD-LFU and ATD-LRU compete and the output of
PSEL is an indicator of which policy is doing better. The replacement policy to be used in
MTD is chosen based on the value of PSEL. We call this mechanism Tournament Selection
(TSEL). Figure 3.2 shows the operation of the TSEL mechanismfor one set in the cache.
If a given access hits or misses in both ATD-LFU and ATD-LRU, neither policy is
doing better than the other. Thus, PSEL remains unchanged. If an access misses in ATD-
LFU but hits in ATD-LRU, LRU is doing better than LFU for that access. In this case,
PSEL is decremented. Conversely, if an access misses in ATD-LRU but hits in ATD-LFU,























else MTD uses LRU
MTD uses LFUIf MSB of PSEL is 1, 
Figure 3.2: Tournament selection of replacement policies for a single set.
that result in a miss for MTD are serviced by the memory system. If an access results in
a hit for MTD but a miss for either ATD-LFU or ATD-LRU, then it is not serviced by the
memory system. Instead, the ATD that incurred the miss finds arepl cement victim using
its replacement policy and updates the tag field associated with the replacement victim.
Unless stated otherwise, we use a 10-bit PSEL counter in our experiments. All PSEL
updates are done using saturating arithmetic.
If LFU incurs fewer misses than LRU, then PSEL will be saturated owards its
maximum value. Similarly, PSEL will be saturated towards zero if the opposite is true.
If the most significant bit (MSB) of PSEL is 1, the output of PSEL indicates that LFU is
doing better. Otherwise, the output of PSEL indicates that LRU is doing better.
17
A simple method of extending the TSEL mechanism for the entircache is to have
both ATD-LFU and ATD-LRU feed a single global PSEL counter. The output of the single
PSEL decides the policy forall the sets in MTD. We call this mechanism TSEL-global. An





























for All Sets in MTD
ATD−LFU
Figure 3.3: TSEL-global mechanism
3.3.2 Results for Tournament Selection
Figure 3.4 compares the MPKI of LRU, LFU, and the TSEL-globalmechanism. In
almost all cases TSEL-global provides a similar MPKI as the better of the two policies,
LRU and LFU. For ammp, TSEL-global has better MPKI than either of the component
policies. This happens because ammp has two phases during the program execution. LFU
has fewer misses than LRU in the first phase and LRU has fewer misses than LFU in the
second phase. With TSEL-global the cache is able to get the policy better suited to each
18
phase, thus outperforming each of the component policies. Although we have explained
the TSEL-global mechanism with LRU and LFU as the component policies, the mechanism




















































































Figure 3.4: Comparison of replacement policies: LRU, LFU, and Selecting between LRU
and LFU using TSEL-global.
3.4 Dynamic Set Sampling
Although TSEL-global can select between component policies, it requires two ATDs,
each sized the same as MTD, which makes TSEL-global a high-overhead option. The key
insight that allows us to reduce the number of ATD entries forTSEL-global is that it is
not necessary to have all the sets participate in deciding the output of PSEL. If only a few
sampled sets are allowed to decide the output of PSEL, then the TSEL-global mechanism
19
will still choose the best performing policy with a high probability. The sets that participate
in updating PSEL are calledLeader Sets. Figure 3.5 shows a TSEL-global mechanism with
Dynamic Set Sampling (DSS). Sets B, E, and G are the leader sets. These sets have ATD
entries and are the only sets that update the PSEL counter. The are no ATD entries for
the remaining sets.
PSEL
Set  E Set  E
Set  G
Set  B Set  B
Set  G











for All Sets in MTD
ATD−LRU ATD−LFU
Figure 3.5: Reducing ATD overhead via Dynamic Set Sampling.
For the example in Figure 3.5, DSS reduces the number of ATD entries required for
the TSEL-global mechanism to 3/8 of its original value. A natur l question is: how many
leader sets are sufficient to select the best performing replac ment policy? We provide both
analytical as well as empirical answers to this question.
20
3.4.1 Analytical Model for Dynamic Set Sampling
We make the simplifying assumption that all sets affect performance equally. Let
P (Best) be the probability that the best performing policy is selected by the sampling-
based TSEL-global mechanism. Let there beN sets in the cache. Letp be the fraction of
the sets that favor the best performing policy. Given that wehave two policies, LRU and
LFU, by definitionp ≥ 0.5. Thus, if only one set is selected at random from the cache as
the leader set, thenP (Best) = p.
If three sets (N ≫ 3) are chosen at random from the cache as leader sets, then for
the mechanism to correctly select the globally best performing policy, at least two of the
three leader sets should favor the globally best performingpolicy. Thus, for three leader
sets,P (Best) is given by:
P (Best) = p3 + 3 · p2 · (1 − p) (3.1)
In general, if k sets (k ≪ N) are randomly selected from the cache as leader sets,













(ki ) · p(k−i) · (1 − p)i...For even k (3.3)
Where(ki ) refers to the number of combinations ofi elements from a group ofk
elements(k!/(i! · (k− i)!). Figure 3.6 plotsP (Best) for different numbers of leader sets as
p is varied. When there is a significant difference in the performance of the two replacement
policies, the value of p is higher than 0.7. Therefore, from Figure 3.6 we can conclude that
21
a small number of leader sets (16-32) is sufficient to select the globally best-performing
policy with a high (> 95%) probability. This is an important result because it means that
the baseline cache can have expensive ATD entries for only 16-32 sets (i.e., about 2% to
3% of all sets) instead of all the 1024 sets in the cache.
0 8 16 24 32 40





































Figure 3.6: Analytical Bounds on Number of Leader Sets.
3.5 Sampling Based Adaptive Replacement
DSS makes it possible to choose the best performing policy with high probability
even with very few sets in the ATD. Because the number of leader sets is small, the hard-
ware overhead can be further reduced by embedding the functionality of one of the ATDs
in MTD. Figure 3.7 shows such a sampling-based hybrid scheme, calledSampling Based
Adaptive Replacement (SBAR). The sets in MTD are logically divided into two categories:
Leader SetsandFollower Sets. The leader sets in MTD use only the LRU policy for re-
placement and participate in updating the PSEL counter. Thefollower sets implement both
22
the LFU and the LRU policies for replacement and use the PSEL output to choose their
replacement policy. The follower sets do not update the PSELcounter. There is only a
single ATD, ATD-LRU. ATD-LRU implements only the LRU policyand has only sets cor-
responding to the leader sets. In Figure 3.7 Sets B, E, and G are le der sets and Sets A,
C, D, F, and H are follower sets. Thus, the SBAR mechanism requir s a single ATD with

















for Follower Sets in MTD
LEGEND FOR MTD
Miss in Leader Sets of MTD
Follower Sets
Have ATD−LFU entries
Always follow LRU policy
Policy decided by PSEL
No ATD−LFU entries
Leader Sets. 
Figure 3.7: Sampling Based Adaptive Replacement
Although the example shows LRU policy implemented in leadersets of MTD, in
general, any of the two policies can be implemented in the leader sets of the MTD and the
state of the other policy can be tracked using the ATD. Figure3.5 shows is a special case
of the SBAR mechanism where both policies are tracked using the ATD. An alternative
configuration where both policies are implemented in some small dedicated number of sets
and using the better performing policy on the remaining sets. This option in explored by
the in-cache dynamic set sampling mechanism in detail in theChapter 4.
23
3.5.1 Leader Set Selection Mechanism
We now discuss a method to select leader sets. Let N be the number of sets in the
cache and K be the number of leader sets (in our studies we restrict the number of leader
sets to be a power of 2). We logically divide the cache into K equally-sized regions each
containing N/K sets. We call each such region aconstituency. One leader set is chosen
from each constituency, either statically at design time ordynamically at runtime. A bit
associated with each set then identifies whether the set is a leader set. We propose a leader
set selection policy that obviates the need for marking the leader set in each constituency on
a per-set basis. We call this policy thesimple-staticpolicy. It selects set 0 from constituency
0, set 1 from constituency 1, set 2 from constituency 2, and soon. For example, if K=32 and
N=1024, the simple-static policy selects sets 0, 33, 66, 99,..., and 1023 as leader sets. For
the leader sets, bits [9:5] of the cache index are identical to the bits [4:0] of the cache index,
which means that the leader sets can easily be identified using a single five-bit comparator
without any extra storage. Unless stated otherwise, SBAR uses the simple-static policy.
3.5.2 Hardware Cost of SBAR
The dynamic selection of SBAR comes at a small hardware overhead in terms of the
ATD entries. Table 3.3 details the storage overhead of SBAR assuming a 40-bit physical
address space and 32 leader sets. SBAR requires a storage overh ad of 1920 bytes, which
is less than 0.2% of the total area of the baseline L2 cache. Inaddition MTD needs storage
for implementing the two component replacement policies.
Table 3.3: Storage overhead of SBAR.
Size of each ATD entry (1 valid bit + 24-bit tag + 5-bit LFU) 30 bits
Total number of ATD entries per leader set 16
ATD overhead per leader set (30 bits/way * 16 ways) 60 B
Total SBAR overhead (32 leader sets * 60 B/set) 1920 B
Area of baseline L2 cache (64kB tags + 1MB data) 1088 kB
Percentage increase in L2 area due to SBAR (1920B/1088kB)0.18%
24
3.6 Results









































































Figure 3.8: Comparison of TSEL-global and SBAR.
Figure 3.8 shows the MPKI of the TSEL-global mechanism and the SBAR mecha-
nism relative to the baseline cache with LRU policy. Both TSEL-global and SBAR select
between LRU and LFU. The reduction provided by both mechanisms is similar, except
that SBAR has substantially less hardware overhead than TSEL-global as it requires 64x
fewer ATD entries than TSEL-global. Thus DSS allows SBAR to implement the hybrid
replacement mechanism in a cost-effective manner.
25
3.6.2 Effect of Number of Leader Sets on SBAR
We use 32 leader sets for implementing SBAR. This sections analyzes the sensitiv-
ity of varying the number of leader sets on the performance ofSBAR. Figure 3.9 compares
SBAR mechanism with 16, 32, and 64 leader sets with the TSEL-global mechanism. The
PSEL counter is scaled appropriately for SBAR, using a 9-bitPSEL for 16 leader sets and a
11-bit PSEL for 64 leader sets. For two benchmarks, ammp and apsi, there is a significant
difference between the MPKI of SBAR with 16 leader sets and TSEL-global indicating
that 16 sets are not sufficient for SBAR to perform similar to TSEL-global. However, with



































SBAR (16 leader sets)
SBAR (32 leader sets)







































Figure 3.9: Effect of Number of Leader Sets on SBAR.
26
3.6.3 SBAR selection between LRU and Random replacement
The proposed SBAR mechanism can be used to select between anytwo replacement
policies. For example, SBAR can be used to select between LRUand random (RND)
replacement by implementing random replacement in the ATD.The PSEL counter would
then be an indicator of which of the two policies, LRU and RND,is doing better. In
our experiments with random replacement, we implement RND using the gnu c rand()
function. Figure 3.10 compares the MPKI of three replacement schemes: LRU, RND, and
the SBAR-based dynamic selection between LRU and RND. RND replac ment reduces
misses substantially for benchmarks art, facerec, ammp, galgel, sixtrack, and apsi while
increasing misses for benchmarks by more than 20% for benchmarks twolf, vpr, bzip, and
parser. SBAR based dynamic selection between LRU and RND hasMPKI that is similar












































































Figure 3.10: Comparisons of LRU, Random, and SBAR (LRU+RND)
27
3.7 Summary
This chapter proposed an implementable mechanism that can adaptively select be-
tween two replacement policies, depending on which policy is providing fewer misses at a
given time during execution. Different replacement policies can perform better in different
program phases, and therefore having an adaptive hybrid replac ment policy can provide
better performance than either of the constituent replacement policies.
We propose the Dynamic Set Sampling (DSS) mechanism that predicts the behav-
ior of the whole cache by sampling the behavior of only a few sets in the cache. In our
hybrid replacement scheme, DSS significantly reduces the hardware cost of predicting the
performance impact of a replacement policy. In general, DSSis a basic building block that




The commonly used LRU replacement policy is susceptible to thrashing for memory-
intensive workloads that have a working set greater than theavailable cache size. For such
applications, the majority of lines traverse from the MRU positi n to the LRU position
without receiving any cache hits, thus, resulting in inefficient use of cache space. Cache
performance can be improved if some fraction of the working set i retained in the cache
so that at least that fraction of the working set can contribue to cache hits.
We show that simple changes to theinsertion policycan significantly reduce cache
misses for memory-intensive workloads. We propose theLRU Insertion Policy (LIP)which
places the incoming line in the LRU position instead of the MRU position. LIP protects
the cache from thrashing and results in close to optimal hit-rate for applications that have
a cyclic reference pattern. We also propose theBimodal Insertion Policy (BIP)as an en-
hancement of LIP that adapts to changes in the working set while maintaining the thrashing
protection of LIP. We finally propose aDynamic Insertion Policy (DIP)to choose between
BIP and the traditional LRU policy depending on which policyncurs fewer misses. The
proposed insertion policies do not require any change to theexisting cache structure, are
trivial to implement, and have a storage requirement of lessthan two bytes. DIP reduces
the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of
the gap between LRU and OPT.
29
4.1 Introduction
The LRU replacement policy and its approximations have remained as the de-facto
standard for replacement policy in on-chip caches over the last several decades. While
the LRU policy has the advantage of good performance for high-locality workloads, it
can have a pathological behavior for memory-intensive workloads that have a working set
greater than the available cache size. There have been numero s proposals to improve the
performance of LRU, however, many of these proposals incur ahuge storage overhead,
significant changes to existing design, and poor performance for LRU-friendly workloads.
Every added structure and change to the existing design requires design effort, verifica-
tion effort, and testing effort. Therefore, it is desirablethat changes to the conventional
replacement policy require minimal changes to the existingdesign, require no additional
hardware structures, and perform well for a wide variety of applications. This chapter fo-
cuses on designing a cache replacement policy that performswell for both LRU-friendly
and LRU-averse workloads while requiring negligible hardware overhead and changes.
We divide the problem of cache replacement into two parts:victim selection policy
andinsertion policy. The victim selection policy decides which line gets evicted for storing
an incoming line, whereas, the insertion policy decides where in the replacement list is the
incoming line placed. For example, the traditional LRU replacement policy inserts the
incoming line in the MRU position, thus using the policy ofMRU Insertion. Inserting
the line in the MRU position gives the line a chance to obtain ahit while it traverses all
the way from the MRU position to the LRU position. While this may be a good strategy
for workloads whose working-set is smaller than the available cache size or for workloads
that have high temporal locality, such an insertion policy causes thrashing for memory-
intensive workloads that have a working set greater than theavailable cache size. We show
that with the traditional LRU policy, more than 60% of the lines installed in the L2 cache
remain unused between insertion and eviction. Thus, most ofthe inserted lines occupy
30
cache space without ever contributing to cache hits. When thworking set is larger than
the available cache size, cache performance can be improvedby retaining some fraction
of the working set long enough that at least that fraction of the working set contributes to
cache hits. However, the traditional LRU policy offers no prtection for retaining the cache
lines longer than the cache capacity.
We show that simple changes to the insertion policy can significa tly improve cache
performance for memory-intensive workloads while requiring negligible hardware over-
head. We propose theLRU Insertion Policy (LIP)which placesall the incoming lines in
the LRU position. These lines are promoted from the LRU positi n o the MRU position
only if they get referenced while in the LRU position. LIP prevents thrashing for work-
loads whose working set is greater than the cache size and obtains near-optimal hit rates for
workloads that have a cyclic access pattern. LIP can easily be implemented by avoiding
the recency update at insertion.
LIP may retain the lines in the non-LRU position of the recency stack even if they
cease to contribute to cache hits. Since LIP does not have an aging mechanism, it may
not respond to changes in the working set of a given application. We propose theBimodal
Insertion Policy (BIP), which is similar to LIP, except that BIP infrequently (witha low
probability) places the incoming line in the MRU position. We show that BIP adapts to
changes in the working set while retaining the thrashing protection advantages of LIP.
For LRU-friendly workloads that favor the traditional policy of MRU insertion,
the changes to the insertion policy are detrimental to cacheperformance. We propose a
Dynamic Insertion Policy (DIP)to choose between the traditional LRU policy and BIP
depending on which policy incurs fewer misses. DIP requiresruntime estimates of misses
incurred by each of the competing policies. To implement DIPwithout requiring significant
hardware overhead, we proposeIn-Cache Dynamic Set Sampling (IDSS). IDSS dedicates a
few sets of the cache to each of the two competing policies anduses the policy that performs
31
better on thededicated setsfor the remainingfollower sets. We analyze both analytical as
well as empirical bounds for the number of dedicated sets andshow that as few as 32 to 64
dedicated sets are sufficient for IDSS to choose the best policy. An implementation of DIP
using IDSS requires no extra storage other than a single saturating counter and performs
similar to LRU for LRU-friendly workloads.
Insertion policies come into effect only during cache misses, therefore, changes to
the insertion policy does not affect the access time of the cache. The proposed changes
to the insertion policy are particularly attractive as theydo not requireanychanges to the
structure of an existing cache design, incur only a negligible amount of logic circuitry, and
have a storage overhead of less than two bytes.
4.2 Motivation
Our study is focused on reducing L2 misses by managing the L2 cache efficiently.
The access stream visible to the L2 cache has filtered temporal locality due to the hits in the
first-level cache. The loss of temporal locality causes a significant percentage of L2 cache
lines to remain unused. We refer to cache lines that are not referenced between insertion
and eviction aszero reuse lines. Figure 4.1 shows that for the baseline 1MB 16-way LRU-
managed L2 cachemore than half the lines installed in the cache are never reused before
getting evicted. Thus, the traditional LRU policy results in inefficient useof cache space as
most of the lines installed occupy cache space without contributing to cache hits.
Zero reuse lines occur because of two reasons. First, the line has no temporal
locality which means that the line is never re-referenced. It is not beneficial to insert such
lines in the cache. Second, the line is re-referenced at a dist nce greater than the cache
size, which causes the LRU policy to evict the line before it gets reused. Several studies





































































Figure 4.1: Percentage of Zero Reuse Lines for the Baseline 1MB 6-way L2 cache
temporal locality. However, temporal locality exploited by the cache is a function of both
the replacement policy and the size of the working set relative to the available cache size.
For example, if a workload frequently reuses a working set of2 MB, and the available cache
size is 1MB, then the LRU policy will cause all the installed lines to have poor temporal
locality. In such a case, bypassing or early evicting all thelin s in the working set will not
improve cache performance. The optimal policy in such casesis to retain some fraction of
the working set long-enough so that at least that fraction ofthe working set provides cache
hits. However, the traditional LRU policy offers no protection for retaining the cache lines
longer than the cache capacity.
For workloads with working set greater than the cache size, cache performance can
be significantly improved if the cache can retain some fraction of the working set. To
achieve this, we separate the replacement policy into two parts: victim selection policyand
insertion policy. The victim selection policy decides which line gets evicted for storing an
incoming line. The insertion policy decides where in the replacement list is the incoming
line placed. We propose simple changes to the insertion policy that significantly improves
cache performance of memory-intensive workloads while requiring negligible overhead.
33
4.3 Static Insertion Policies
The traditional LRU replacement policy inserts all the incoming lines in the MRU
position. Inserting the line in the MRU position gives the line a chance to obtain a hit while
it traverses all the way from the MRU position to the LRU position. While this may be a
good strategy for workloads whose working set is smaller than t e available cache size or
for workloads that have a high temporal locality, such an insertion policy causes thrashing
for memory-intensive workloads that have a working set greater than the available cache
size. When the working set is greater than the available cache size, cache performance can
be improved by retaining some fraction of the working set long e ough that at least that
fraction of the working set results in a cache hit.
For such workloads, we propose theLRU Insertion Policy (LIP), which placesall
incoming lines in the LRU position. These lines are promotedfrom the LRU position to the
MRU position only if they are reused while in the LRU position. LIP prevents thrashing for
workloads that reuse a working set greater than the available cache size. To our knowledge
this is the first study to investigate the insertion ofdemandlines in the LRU position.
Earlier studies [21] have proposed to insert prefetched lines the LRU position to reduce
the pollution caused by inaccurate prefetching. However, th y were targeting the problem
of extraneous references generated by the prefetcher whileour study is targeted towards
the fundamental locality problem in memory reference streams. In their model, demand
references were still placed in the MRU position leaving thecache vulnerable to thrashing
under LRU replacement.
LIP may retain the lines in the non-LRU position of the recency stack even if they
cease to be re-referenced. Since LIP does not have an aging mechanism, it may not respond
to changes in the working set of the given application. We propose theBimodal Insertion
Policy (BIP) which is similar to LIP, except that it infrequently (with a low probability)
places some incoming lines into the MRU position. BIP is regulated by a parameter,over-
34
ride probability(ǫ), which controls the percentage of incoming lines that are placed in the
MRU position. Both traditional LRU policy and LIP can be viewd as a special case of BIP
with ǫ = 1 andǫ = 0 respectively. In Section 4.3.1 we show that for small valuesof ǫ, BIP
can adapt to changes in working set while retaining the thrashing protection of LIP.
4.3.1 Analysis with Cyclic Reference Model
To analyze workloads that cause thrashing with the LRU policy, we use a theoretical
model of cyclic references. A similar model has been used earlier by McFarling [48] for
modeling conflict misses in a direct-mapped instruction cache. Letai denote the address
of a cache line. Let(a1 · · · aT ) denote a temporal sequence of referencesa1, a2, ...,aT . A
temporal sequence that repeats forN times is represented as(a1 · · · aT )N .
Let there be an access pattern in which(a1 · · · aT )N is followed by(b1 · · · bT )N .
We analyze the behavior of this pattern for a fully associative cache that contains space for
storingK(K < T ) lines. We assume that the parameterǫ in BIP is small, and that both
sequences in the access pattern repeat many times (N >> T andN >> K/ǫ ). Table 3
compares the hit-rate of LRU, OPT, LIP, and BIP for this access pattern.
Table 4.1: Hit Rate for LRU, OPT, LIP, and BIP under Cyclic Refer nce Model
Policy (a1 · · · aT )N (b1 · · · bT )N
LRU 0 0
OPT (K − 1)/(T − 1) (K − 1)/(T − 1)
LIP (K − 1)/T 0
BIP (K − 1 − ǫ · [T − K])/T ≈ (K − 1)/T ≈ (K − 1 − ǫ · [T − K])/T ≈ (K − 1)/T
As the cache size is less thanT , LRU causes thrashing and results in zero hits for
both sequences. The optimal policy is to retain any K out of the T lines of the cyclic
reference so that those K lines receive hits. For both sequence, OPT obtains a hit rate
35
(K−1)/(T −1) [72]. After the cache is warmed up, LIP evicts the most recently i stalled
line and achieves a hit-rate of(K −1)/T for the first sequence. However, LIP never allows
any element of the second sequence to enter the non-LRU position of the cache, thus,
causing zero hits for the second sequence.
In each iteration, BIP inserts approximatelyǫ · (T − K) lines in the MRU position
which means a hit-rate of(K −1− ǫ · [T −K])/T . As the value ofǫ is small, BIP obtains a
hit-rate of approximately(K − 1)/T , which is similar to the hit-rate of LIP. However, BIP
probabilistically allows the lines of any sequence to enterth MRU position. Therefore,
when the sequence changes from the first to the second, all thelines in the cache belong
to the second sequence afterK/ǫ misses. For large N, the transition time from the first
sequence to the second sequence is small, and the hit-rate ofBIP is approximately equal
to (K − 1)/T . Thus, for small values ofǫ, BIP can respond to changes in the working set
while retaining the thrashing protection of LIP.
4.3.2 Case Studies of Memory-Intensive Thrashing Workloads
In addition to the SPEC benchmarks, we also used the health benchmark from the
olden suite for evaluations in this chapter. The working setof the health benchmark in-
creases with time, which results in thrashing with the LRU policy for the later parts of
program execution. We ran the health benchmark to completion. The MPKI for the base-
line L2 cache for health is 61.7 with 0.7% of the misses as compulsory misses.
We analyze LIP and BIP in detail using three memory-intensive benchmarks: mcf,
art, and health. These benchmarks incur the highest MPKI forthe SPEC INT, SPEC FP, and
olden benchmark suite respectively. The LRU policy resultsin hrashing as the working set
of these benchmarks is greater than the baseline 1MB cache. For all experiments in this
section a value ofǫ = 1/32 is used.
36
4.3.2.1 The mcf benchmark:
Causes 84% of all L2 misses







   arcin= (arc_t *) tail−>mark;
   continue;
}
arcin= (arc_t *) tail−>mark
}
 tail−>time + arcin−>org_cost
{
 > latest)
Figure 4.2: Miss-causing instructions from the mcf benchmark
Figure 4.2 shows the code structure from theimplicit.c file of the mcf bench-
mark with the three load instructions that are responsible for 84% of the total L2 misses
for the baseline cache. The kernel of mcf can be approximateds linked-list traversals of
a data structure whose size is approximately 3.5MB. Figure 4.3 shows the MPKI for mcf
when the cache size is varied under the LRU policy. The MPKI reduc s only marginally
till 3.5MB and then the first“knee” of the MPKI curve occurs. LRU results in thrashing
for the baseline 1MB cache and almost all the inserted lines ar evicted before they can be
reused. Both LIP and BIP retain around 1MB out of the 3.5MB working set resulting in
hits for at least that fraction of the working set. For the baseline 1MB cache, LRU incurs
an MPKI of 136, both LIP and BIP incur an MPKI of 115 (17% reduction over LRU), and
OPT incurs an MPKI of 101 (26% reduction over LRU). Thus, bothLIP and BIP bridge
two-thirds of the gap between LRU and OPT without requiring extra storage.
37
 0 1 2 3 4 5 6 7 8





































Figure 4.3: MPKI vs. cache size for mcf
4.3.2.2 The art benchmark:
Figure 4.4 shows the code snippet from thescanner.c file of the art benchmark
containing the two load instructions that are responsible for 80% of all the misses for the
baseline cache. The first load instruction traverses an array of typef1 layer. The class
of f1 layer defines it as aneuron containing seven elements of typedouble and one
element of type pointer todouble. Thus, the size of each each object of typef1 layer
is 64B. For ref-1 input set,numf1s=100000, therefore, the total size of the array of
f1 layer is 64B ∗ 10K = 640KB. The second load instruction traverses a two dimen-
sional array of typedouble. The total size of this array is8B ∗ 11 ∗ 10K = 880KB.
Thus, the size of the working set of the kernel is approximately 1.5MB.
Figure 4.5 shows the MPKI of art for varying cache size under LRU replacement.
LRU is oblivious to the“knee” around 1.5MB and causes thrashing for the baseline 1MB
cache. Both LIP and BIP prevent thrashing by retaining a significa t fraction of the working






// = 100*100 for ref input set
// = 10+1 for ref input set
numf1s = lwidth*lheight;
Y[tj].y +=  f1_layer[ti].P  bus[ti][tj] * ;
for (ti=0;ti<numf1s;ti++)
if( !Y[tj].reset )
Y[tj].y = 0; Causes 41% of all L2 misses
Causes 39% of all L2 misses
Figure 4.4: Miss-causing instructions from the art benchmark
 0  0.25  0.5 0.75 1.0 1.25 1.50 1.75 2.0

































Figure 4.5: MPKI vs. cache size for art
MPKI of 23.6 (39% reduction over LRU), BIP incurs an MPKI of 18(54% reduction over
LRU), and OPT incurs an MPKI of 12.8 (67% reduction over LRU).Both LIP and BIP are
closer to OPT. The adaptation in BIP results in much lower MPKI with BIP than with LIP.
39
4.3.2.3 The health benchmark:
while (list != NULL) {





Causes 71% of all L2 misses
Figure 4.6: Miss-causing instruction from the health benchmark
Figure 4.6 shows a code snippet from theh alth.c file. It contains the pointer
de-referencing load instruction that is responsible for more than 70% of the misses for
the baseline cache. The health benchmark can be approximated as a micro kernel that per-
forms linked list traversals with frequent insertions and deletions. The size of the linked-list
data structure increases dynamically with program execution. Thus, the memory reference
stream can be approximated as a cyclic reference sequence for which the period increases
with time. To show the dynamic change in the size of the working set, we split the bench-
mark execution into four parts (of approximately 50M instruc ions each). Figure 4.7 shows
the MPKI of each of the four phases of execution of health as the cache size is varied under
the LRU policy. During the first phase, the size of the workingset is less than the baseline
1MB cache so the LRU policy works well. However, in the other three phases, the size of
the working set is greater than 1MB, which causes thrashing wth LRU. For the full execu-
tion of health, LRU incurs an MPKI of 61.7, LIP incurs an MPKI of 38 (38.5% reduction
over LRU), BIP incurs an MPKI of 39.5 (36% reduction over LRU), and OPT incurs an
MPKI of 34 (45% reduction over LRU).
40
 0  0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0



































Figure 4.7: MPKI vs. cache size for health
4.3.3 Case Study of a Memory-Intensive LRU-Friendly Workload
For workloads that cause thrashing with LRU, both LIP and BIPreduce cache
misses significantly. However, some workloads inherently favor the traditional policy of
inserting the incoming line at the MRU position. In such cases, changing the insertion pol-
icy can hurt cache performance. An example of such a workloadis the swim benchmark
from the SPEC FP suite. Swim performs matrix multiplies in its kernel. The first“knee”
of the matrix multiplication occurs at1
2
MB while the second“knee” occurs at a cache size
greater than 64 MB. Figure 4.8 shows the MPKI for swim as the cache size is increased
from 1
8
MB to 64 MB under LRU replacement. There is a huge reduction inMPKI as the




MB. However, subsequent increase in cache size till
64 MB does not have a significant impact on MPKI. For the baseline cache, the MPKI with
both LRU and OPT are similar indicating that there is no scopef r reducing misses over
the LRU policy. In fact, changes to the insertion policy can only reduce the hits obtained
41
from the middle of the LRU stack for the baseline 1 MB cache. Therefore, both LIP and
BIP increase MPKI significantly over the LRU policy. For the baseline cache, LRU incurs
an MPKI of 23, LIP incurs an MPKI of 46.5, BIP incurs an MPKI of 44.3, and OPT incurs
an MPKI of 22.8.
1/8 1/4 1/2 1 2 4 8 16 32 64



























Figure 4.8: MPKI vs. cache size for swim (Note: horizontal axis is in log scale)
4.3.4 Results
Figure 4.9 shows the reduction in MPKI with the two proposed insertion policies,
LIP and BIP, over the baseline LRU replacement policy. For BIP, we show results for
ǫ = 1/64, ǫ = 1/32, andǫ = 1/16 which mean every 64th, 32nd, or 16th miss is inserted
in the MRU position respectively.1
1In our studies, we restrict the value ofǫ to 1/power-of-two. To implement BIP, a pseudo-random number
generator is required. If there is no pseudo-random number available then an n-bit free running counter can
be used to implement a 1-out-of-2n policy (n = log2(1/ǫ)). The n-bit counter is incremented on every cache
miss. BIP inserts the incoming line in the MRU position only if the value of this n-bit counter is zero. We












































































Figure 4.9: Comparison of Static Insertion Policies
The thrashing protection of LIP and BIP reduces MPKI by 10% orm e for nine
out of the sixteen benchmarks. BIP has better MPKI reductiontha LIP for art and ammp
because it can adapt to changes in the working set of the applic tion. For most applica-
tions that benefit from BIP, the amount of benefit is not sensitive to the value ofǫ. For
benchmarks equake, parser, bzip2 and swim both LIP and BIP increase the MPKI consid-
erably. This occurs because these workloads either have an LRU friendly access pattern,
or the knee of the MPKI curve is less than the cache size and there is no significant benefit
from increasing the cache size. For the insertion policy to be useful for a wide variety of
workloads, we need a mechanism that can select between the traditional LRU policy and
BIP depending on which incurs fewer misses. The next sectiondescribes a cost-effective
run-time mechanism to choose between LRU and BIP. For the remainder of the chapter we
use a value ofǫ = 1/32 for all experiments with BIP.
43
4.4 Dynamic Insertion Policy
For some applications BIP has fewer misses than LRU and for some LRU has fewer
misses than BIP. We want a mechanism that can choose the insertio policy that has the
fewest misses for the application. We propose a mechanism that dynamically estimates the
number of misses incurred by the two competing insertion policies and selects the policy
that incurs the fewest misses. We call this mechanismDynamic Insertion Policy (DIP).
A straightforward method of implementing DIP is to implement both LRU and BIP in
two extra tag directories (data lines are not required to estimate the misses incurred by an
insertion policy) and keep track of which of the two policiess doing better. The main
tag directory of the cache can then use the policy that incursthe fewest misses. Since this
implementation of DIP gathers information globally for allthe sets, and enforces a uniform
policy for all the sets, we call itDIP-Global.
4.4.1 The DIP-Global Mechanism
Figure 4.10(a) demonstrates the working of DIP-Global for acache containing six-
teen sets. Let MTD be the main tag directory of the cache. The two competing policies,
LRU and BIP, are each implemented in a separate Auxiliary TagDirectory (ATD). ATD-
LRU uses the traditional LRU policy and ATD-BIP uses BIP. Both ATD-LRU and ATD-
BIP have the same associativity as the MTD. The access streamvisible to MTD is also
applied to both ATD-LRU and ATD-BIP. A saturating counter, which we callPolicy Selec-
tor (PSEL), keeps track of which of the two ATDs incurs fewer misses. Alloperations on
PSEL are done using saturating arithmetic. A miss in ATD-LRUincrements PSEL and a
miss in ATD-BIP decrements PSEL. The Most Significant Bit (MSB) of PSEL is thus an
indicator of which of the two policies incurs fewer misses. If MSB of PSEL is 1, MTD










































































Set  14 Follower Sets
Decides Policy for
(b)(a)
Miss in a Set
Dedicated to LRU
Miss in a Set
Dedicated to BIP
Sets dedicated to BIP
Sets dedicated to LRU
Policy decided by PSEL
Follower Sets. 
LEGEND
Figure 4.10: Implementations of Dynamic Insertion Policy:(a) DIP-Global (b) DIP-IDSS
4.4.2 The DIP-IDSS Mechanism
The DIP-Global mechanism requires a substantial hardware overhead of two ex-
tra tag directories. The hardware overhead of comparing twopolicies can be significantly
reduced by usingDynamic Set Sampling (DSS). The key insight in DSS is that the cache
behavior can be approximated with a high probability by sampling few sets in the cache.
Thus, DSS can significantly reduce the number of ATD entries in DIP-Global from thou-
sand(s) of sets to about 32 sets.
Although DSS significantly reduces the storage required in implementing the ATD
(to around 2kB), it still requires building the separate ATDstructure. Thus, implementing
DIP will still incur the design, verification, and testing overhead of building the separate
ATD structure. We proposeIn-cache Dynamic Set Sampling (IDSS), which obviates the
need for a separate ATD structure. IDSS dedicates few sets ofthe cache to each of the two
45
competing policies. The policy that incurs fewer misses on thededicated setsis used for
the remainingfollower sets. An implementation of DIP that uses IDSS is calledDIP-IDSS.
Figure 4.10(b) demonstrates the working of DIP-IDSS on a cache containing six-
teen sets. Sets 0, 5, 10, and 15 are dedicated to the LRU policy, and Sets 3, 6, 9, and 12
are dedicated to the BIP policy. The remaining sets are follower sets. A miss incurred in
the sets dedicated to LRU increments PSEL, whereas, a miss incurred in the sets dedicated
to BIP decrements PSEL. If the MSB of PSEL is 0, the follower sets use the LRU policy;
otherwise the follower sets use BIP. Note that IDSS does not requi e any separate storage
structure other than a single saturating counter.
DIP-IDSS compares the number of misses across different sets for wo competing
policies. However, the number of misses incurred by even a single policy varies across
different sets in the cache. A natural question is how does thper-set variation in misses of
the component policies affect the dynamic selection of IDSS? Also, how many dedicated
sets are required for DIP-IDSS to approximate DIP-Global with a high probability? In
Section 4.4.3, we derive analytical bounds2 for DIP-IDSS as a function of both the num-
ber of dedicated sets and the per-set variation in misses of the component policies. In
Section 4.4.5 we compare the misses incurred by DIP-IDSS andDIP-Global.
4.4.3 Analytical Model for IDSS
Let there beN sets in the cache. Let IDSS be used to choose between two policies
P1 and P2. When policy P1 is implemented on all the sets in the cache, the average
2Chapter 3 used a Bernoulli model to derive the bounds for DSS.However, because there was a separate
ATD, the model was comparing the two policies by implementing both policies for a few sampled set. That
analytical model does not consider the per-set variation inmisses incurred by the component policies. How-
ever, in the present case, IDSS compares the component policies by implementing them on different sets in
the cache. Therefore, the analytical model of IDSS must takeinto account the per-set variation in misses
incurred by the component policies. Therefore, the bounds derived in Chapter 3 are not applicable to IDSS.
46
number of misses per set isµ1 with standard deviationσ1. Similarly, when policy P2 is
implemented on all the sets in the cache, the average number of misses per set isµ2 with
standard deviationσ2. Let∆ denote the difference in average misses|µ1−µ2| andσ denote





Let n sets be randomly selected from the cache to estimate the misss with policy
P1 and another group ofn sets be randomly selected to estimate the misses with policy
P2. We assume that the number of dedicated setsn is sufficiently large such that by the
central limit theorem[68] the sampling distribution can be approximated as a Gaussi n
distribution. We also assume thatn is sufficiently small compared to the total number of
sets in the cache(N) so that removing then sets does not significantly change the mean
and standard deviation of the remaining(N − n) sets. To derive the bounds for IDSS we
use the following well-established results [68] for sampling distribution: If the distribution
of two independent random variables have the meansµa andµb and the standard deviation
σa andσb, then the distribution of their sum (or difference) has the meanµa+µb (or µa−µb





Let sum1 be the total number of misses for then sets dedicated to policy P1. Then,
by central limit theorem,sum1 can be approximated as a Gaussian random variable with
meanµsum1 and standard deviationσsum1, given by:






n · σ1, (4.2)
Similarly, letsum2 be the total number of misses for then sets dedicated to policy
P2. Then,sum2 can also be approximated as a Gaussian random variable with meanµsum2
and standard deviationσsum2, given by:







n · σ2, (4.4)
PSEL tracks the difference insum1 andsum2 and selects the policy that has fewer
misses on the sampled sets. Letθ denote the difference in the value of the two sums, i.e.
θ = sum1 − sum2. Becausesum1 andsum2 are Gaussian random variables,θ is also a
Gaussian random variable with meanµθ and standard deviationσθ given by:





















Let policy P2 have fewer misses than policy P1, i.e.µ1 > µ2. Then, for IDSS to
select the best policy,θ > 0. If P (Best) is the probability that IDSS selects the best policy,
thenP (Best) can be written as:







P (Best) = P (Z >
−n · (µ1 − µ2)√
n · σ ), where Z =
(θ − µθ)
σθ
is the Standard Gaussian V ariable
(4.8)
P (Best) = 1 − P (Z > n · (µ1 − µ2)√
n · σ ), as P (Z > −z) = 1 − P (Z > z) (4.9)




), where ∆ = |µ1 − µ2| , given µ1 > µ2 (4.10)
48
P (Best) = 1 − P (Z >
√
n · r), where r = ∆
σ
(4.11)
Figure 4.11: P(Best) from Gaussian Curve
Z is the standard Gaussian variable for which the value ofP (Z > z) can be obtained
using standard statistical tables (see Figure 4.11). Equation 11 can be used to compute
P(Best) for any two policies. For example, if for policy P1,µ1 = 100 andσ1 = 12 and
for policy P2,µ2 = 94 andσ2 = 16. Then,∆ = 6, σ = 20 andr = 0.3. For n=32,
P (Best) = 1 − P (Z >
√
32 · 0.3) = 1 − P (Z > 1.7) = 96%.
Figure 4.12 shows the variation in P(Best) as the number of dedicated sets is changed
for different values of ther metric. Ther metric is a function of workload, cache organi-
zation, and the relative difference between the two policies. For most of the benchmarks
studied, the r-metric for the two policies LRU and BIP is morethan 0.2 indicating that
32-64 sampled sets are sufficient for IDSS to select the best policy with a high probabil-
ity. Thus, DIP-IDSS can be implemented by dedicating about 32 to 64 sets to each of the
two policies, LRU and BIP, and using the winning policy (of the dedicated sets) for the
remaining sets.
49
0 8 16 24 32 40 48 56 64
































r = 0.5 
r = 0.4 
r = 0.3 
r = 0.2 
Figure 4.12: Analytical Bounds for IDSS
4.4.4 Dedicated Set Selection Policy
The dedicated set for each of the competing policies can be selected statically at
design time or dynamically at runtime. In this section we describe our method of selecting
the dedicated sets. Let N be the number of sets in the cache andK be the number of sets
dedicated to each policy (in our studies we restrict the number of dedicated sets to powers of
2). We logically divide the cache intoK equally-sized regions each containingN/K sets.
Each such region is called aconstituency[63]. One set is dedicated from each constituency
to each of the competing policies. Two bits associated with each set can then identify the
set as either a follower set or a dedicated set to one of two competing policies.
We employ a dedicated set selection policy that obviates theneed for marking the
leader set in each constituency on a per-set basis. We call this policy thecomplement-
selectpolicy. For a cache with N sets, the set index consists oflog2(N) bits out of which
the most significantlog2(K) bits identify the constituency and the remaininglo 2(N/K)
50
bits identify theoffsetfrom the first set in the constituency. The complement-select policy
dedicates to LRU all the sets for which the constituency identifyi g bits are equal to the
offset bits. Similarly, it dedicates to BIP all the sets for which the complement of the offset
equals the constituency identifying bits. Thus for the baseline cache with 1024 sets, if 32
sets are to be dedicated to both LRU and BIP, then complement-select dedicates Set 0 and
every 33rd set to LRU, and Set 31 and every 31st set to BIP. The sets dedicated to LRU
can be identified using a five bit comparator for the bits [4:0]to bits [9:5] of the set index.
Similarly, the sets dedicated to BIP can be identified using aother five bit comparator
that compares the complement of bits [4:0] of the set index tobits [9:5] of the set index.
Unless stated otherwise, the default implementation of DIPis DIP-IDSS with 32 dedicated
sets using the complement-select policy3 and a 10-bit4 PSEL counter.
4.4.5 Results
Figure 4.13 shows reduction in MPKI with BIP, DIP-Global, and DIP-IDSS with
32 and 64 dedicated sets. The bar labeledameanis the reduction in arithmetic mean MPKI
measured over all the sixteen benchmarks. DIP-Global retains he MPKI reduction of BIP
while eliminating the significant MPKI increase of BIP on benchmarks equake, parser,
mgrid, and swim. With DIP-Global, no benchmark incurs an MPKI increase of more than
2% over LRU. However, DIP-Global requires a significant hardware overhead of about
128kB. DIP-IDSS obviates this hardware overhead while obtaining an MPKI reduction
similar to DIP-Global for all benchmarks, except twolf. As the number of dedicated sets
increase from 32 to 64, the probability of selecting the bestpolicy increases, therefore DIP-
3We also experimented with a rand-dynamic policy which randomly dedicates one set from each con-
stituency to each of the two policies LRU and BIP. We invoke rand-dynamic once every 5M retired instruc-
tions. The MPKI with both rand-dynamics and complement-select are similar. However, rand-dynamic incurs
the hardware overhead of bits for identifying the dedicateds ts which are not required for complement-select.















































































Figure 4.13: Comparison of Dynamic Insertion Policies
IDSS with 64 dedicated sets behaves similar to DIP-Global for tw lf. However, having a
large number of dedicated sets also means that a higher fraction (n/N) of sets always use
BIP, even if BIP increases MPKI. This causes the MPKI of swim to increase by 5% with 64
dedicated sets. For ammp, DIP reduces MPKI by 20% even thoughBIP increases MPKI.
This happens because in one phase LRU has fewer misses and in the other phase BIP has
fewer misses. With DIP, the cache uses the policy best suitedto ach phase and hence a
better MPKI than each of the component policies. We discuss the dynamic adaptation of
DIP in more detail in Section 4.4.6. On average, DIP-Global reduces average MPKI by
22.3%, DIP-IDSS (with 32 dedicated set) reduces average MPKI by 21.3%, and DIP-IDSS
(with 64 dedicated set) reduces average MPKI by 20.3%.
4.4.6 Dynamic Adaptation of DIP to Application Behavior
DIP can adapt to different applications as well as differentphases of the same appli-
cation. DIP uses the PSEL counter to select between the component policies. For a 10-bit
PSEL counter, a value of 512 or more indicates that DIP uses BIP, otherwise DIP uses
LRU. Figure 4.14 shows the value of the 10-bit PSEL counter ovthe course of execution
52
for the benchmarks mcf, art, health, swim, and ammp. We sample the value of the PSEL
counter once every 1M instructions. The horizontal axis denot s the number of instructions
retired (in millions) and the vertical axis represents the value of the PSEL counter.























































Figure 4.14: PSEL value during benchmark execution (horizontal axis denotes the number
of instruction in Millions)
For mcf and art, the DIP mechanism almost always uses BIP. Forhealth, the work-
ing set during the initial part of the program execution fits in the baseline cache and either
policy works well. However, as the dataset increases duringpro ram execution, it exceeds
the size of the baseline cache and LRU causes thrashing. As BIP would have fewer misses
than LRU, the PSEL value reaches toward positive saturationnd DIP selects BIP. For the
53
LRU friendly benchmark swim, the PSEL value is almost alwaystowards negative satu-
ration, so DIP selects LRU. Ammp has two phases of execution:in the first phase LRU
is better and in the second phase BIP is better. With DIP, the policy best suited to each
phase is selected; therefore, DIP has better MPKI than either of the component policies
standalone.
4.5 Analysis
4.5.1 Varying the Cache Size
We vary the cache size from 1 MB to 8 MB and keep the associativity constant at
16-way. Figure 4.15 shows the MPKI of both LRU and DIP for fourcache sizes: 1MB,
2MB, 4MB, and 8MB. The MPKI values are shown relative to the baseline 1MB LRU-
managed cache. The bar labeledavgrepresents the arithmetic mean MPKI measured over
all the sixteen benchmarks. As mcf has a large value of MPKI, the average without mcf,






























































































Figure 4.15: Comparison of LRU and DIP for different cache size
54
DIP reduces MPKI more than doubling the size of the baseline 1MB cache for
benchmarks mcf, facerec, and health. DIP continues to reduce misses for most benchmarks
that benefit from increased capacity. The working set of somebenchmarks, e.g. vpr and
twolf, fit in a 2MB cache. Therefore, neither LRU nor DIP reduces MPKI when the cache
size is increased. Overall, DIP significantly reduces averag MPKI over LRU for all cache
sizes till 4MB.
4.5.2 Bypassing Instead of Inserting at LRU Position
DIP uses BIP which inserts most of the incoming lines in the LRU position. If
such a line is accessed in the LRU position, only then is it update to the MRU position.
Another reasonable design point is to bypass the incoming line instead of inserting it in the
LRU position. A DIP policy that employs BIP which bypasses the incoming line when the
incoming line is to be placed in the LRU position is calledDIP-Bypass. Figure 4.16 shows





























































































Figure 4.16: Effect of Bypassing on DIP (The number shows thepercentage of misses
bypassed by DIP-Bypass).
For all benchmarks, except art, facerec, and sixtrack, DIP reduces MPKI more than
55
DIP-Bypass. This happens because DIP promotes the line installed in the LRU position
to the MRU position if the line is reused, thus increasing theus ful lines in the non-LRU
positions. On the other hand, DIP-Bypass has the advantage of power savings as it avoids
the operation of inserting the line in the cache. The percentage of misses that are bypassed
by DIP-Bypass are shown in Figure 4.16 by a number associatedwith each benchmark
name. Thus, the proposed insertion policies can be used to reuce misses, cache power or
both.
4.5.3 Impact on System Performance
To evaluate the effect of DIP on the overall processor performance, we use an in-
house execution-driven simulator based on the Alpha ISA. The relevant parameters of our
model are given in Table 5. Figure 6.10 shows the performanceimprovement measured in
instructions per cycle (IPC) between the baseline system and the same system with DIP.
The bar labeledgmeanis the geometric mean of the individual IPC improvements seen by
each benchmark. The system with DIP outperforms the baseline by an average of 9.3%.




























































Figure 4.17: IPC improvement with DIP
4.5.4 Estimation of Hardware Overhead and Design Changes
The proposed insertion policies (LIP, BIP, and DIP) requirenegligible hardware
overhead and design changes. LIP inserts all incoming linesin the LRU position, which
56
can easily be implemented by not performing the update to theMRU position that occurs
on cache insertion.5 BIP is similar to LIP, except that it infrequently inserts anincoming
line into the MRU position. To control the rate of MRU insertion in BIP, we use a five-bit
counter (BIPCTR). BIPCTR is incremented on every cache miss. BIP inserts the incoming
line in the MRU position only if the BIPCTR is zero. Thus, BIP incurs a storage overhead
of 5 bits. DIP requires storage for the 10-bit saturating counter (PSEL). The complement-



















Figure 4.18: Hardware changes for implementing DIP
Figure 4.18 shows the design changes incurred in implementing DIP. The imple-
mentation requires a total storage overhead of 15 bits (5-bit BIPCTR + 10-bit PSEL) and
negligible logic overhead. A particularly attractive aspect of DIP is that it does not require
extra bits in the tag-store entry, thus avoiding changes to the existing structure of the cache.
The absence of extra structures also means that DIP does not incur power and complexity
overheads. As DIP does not add any logic to the cache access path, the access time of the
cache remains unaffected.
5LIP, BIP, and DIP do not rely on true LRU which makes them amenable to the LRU approximations
widely used in current on-chip caches.
57
4.5.5 Interaction with Prefetching
We simulate a PC-based stride prefetcher for the baseline machine to analyze the
impact of prefetching on the proposed DIP mechanism. Figure4.19 shows the reduction
in MPKI for the baseline machine with no prefetching for three configurations. First, the
DIP mechanism without prefetching. Second, the baseline LRU policy with prefetching
enabled. Finally, the DIP mechanism with prefetching enabled. To reduce the pollution
caused by prefetching, we always insert the prefetches in the LRU position as suggested by
Lin et al. [21]. Prefetches are promoted from the LRU position t the MRU position on a
hit.
For benchmarks such as art and mcf, prefetching reduces misses for the baseline
machine. However, when DIP and prefetching are combined, the reduction is more than
either scheme standalone. For benchmarks such as equake, bzip2 and parser, DIP alone
does not affect the number of misses while prefetching can reduc the number of misses
significantly. When DIP and prefetching are combined in suchscenarios, DIP retains the
effectiveness of prefetching. Thus DIP and prefetching canboth be used to reduce misses.
4.6 Related Work
This section summarizes the work that most closely relates to the techniques pro-
posed in this chapter, distinguishing our work where appropriate.
4.6.1 Alternative Cache Replacement Policies
The problem of thrashing can be mitigated with replacement schemes that are re-
sistant to thrashing. If the working set of an application isonly slightly greater than the
available cache size, then even a naive scheme such as randomreplacement can have fewer













































































Figure 4.19: Interaction of Insertion Policy with Prefetching
significantly reduces as the size of the working set increases. For the baseline cache ran-
dom replacement reduces MPKI for the thrashing workloads: art by 34%, mcf by 1.6%,
facerec by 14.4%, and health by 16.9%, whereas, DIP reduces MPKI for art by 54%, mcf
by 17%, facerec by 36% and health by 35%. Several proposals [67][42][54][25][64] have
looked at including frequency (reuse count) information for improving cache replacement.
4.6.2 Related Work in Hybrid Replacement
For workloads that cause thrashing with LRU, both random andfrequency-based
replacement schemes have fewer misses than LRU. However, these schemes significantly
increase the misses for LRU-friendly workloads. Recent studies have investigated hybrid
replacement schemes that dynamically select from two or more c mpeting replacement
policies. Examples of hybrid replacement schemes includeSampling-Based Adaptive Re-
59
placement (SBAR)[63] andAdaptive Cache (AC)[77]. The problem with hybrid replace-
ment is that it requires tracking the replacement information for each of the competing
policies. For example, if the two policies are LRU and LFU (Least Frequently Used), then
each tag-entry in the baseline cache needs to be appended with frequency counters (≥ 5-
bits each) which must be updated on each access. Also, the dynamic selection requires
extra structures (2kB for SBAR and 34kB for AC) which consumehardware and power.
DIP outperforms the best performing hybrid-replacement (See Table 4.2) while obviating
the design changes, hardware overhead, power overhead, andcomplexity of hybrid replace-
ment. In fact, DIP bridges two-third of the gap between LRU and OPT while requiring less
than two bytes of extra hardware.
Table 4.2: Comparison of Replacement Policies
Policy %Reduction in Hardware
MPKI over LRU Overhead
SBAR (LRU+Rand) 8.9 2 kB
AC (LRU+Rand) 9.2 34 kB
SBAR (LRU+LFU) 14.7 12 kB
AC (LRU+LFU) 15.8 44 kB
DIP 21.3 2 B
Belady’s OPT 32.2 N/A
4.6.3 Related Work in Paging Domain
We also analyze some of the related replacement studies fromthe paging domain.
Early Eviction LRU (EELRU) [69] tracks the hits obtained from each recency position for a
larger sized cache. If there are significantly more hits fromthe recency positions larger than
the cache size, EELRU changes the eviction point of the resident pages. For the studies
60
reported in [69], EELRU tracked 2.5 times as many pages as in physical memory. We
analyzed EELRU for our workloads with 2.5 times the tag-store entries. EELRU reduces
the average MPKI by 13.8% while incurring a storage overheadof 168kB, compared to
DIP which reduces average MPKI by 22% while incurring a storage overhead of 2B.
A recent proposal, Adaptive Replacement Cache (ARC) [50], maintains two lists:
recency listandfrequency list. The recency list contains pages that were touched only once
while resident, whereas the frequency list contains pages that were touched at least twice.
ARC dynamically tunes the number of pages devoted to each list. We simulated ARC for
our workloads and found that ARC reduces the average MPKI by 5.64% while requiring
64kB storage.
4.6.4 Related Work in Cache Bypassing and Early Eviction
Several studies have investigated cache bypassing and early eviction. McFarling[48]
proposed dynamic exclusion to reduce conflict misses in a direct-mapped instruction cache.
Gonzalez et al. [24] proposed using alocality prediction tableto bypass access patterns
that are likely to pollute the cache. Johnson [31] used reusecounters with amacro address
tableto bypass lines with low reuse. Several proposals [81] [85][82] exist for bypassing or
early eviction of lines brought by instructions with low locality. Another area of research
has been to predict the last touch to a cache line [41] [43]. After the predicted last touch,
the line can either be turned off [36] or be used to store prefetched data [41].
However,locality, livenessand last touchare a function of both the replacement
policy and the relative size of the working set to the available cache size. For example, if a
cyclic reference pattern with a working set size slightly greater than the available cache size
is applied to a LRU-managed cache, all the inserted lines will have poor locality, will be
dead as soon as they are installed, and will have their last touch at insertion. The solution
in such a case is neither to bypass all the lines nor to evict them early, but to retain some
61
fraction of the working set so that it provides cache hit. DIPretains some fraction of the
working set for longer than LRU, thus obtaining hits for at least those lines.
4.6.5 Related Work in Prefetching
Lin et al. [21] propose to reduce cache pollution due to aggressiv prefetching by
inserting prefetched lines in the LRU position. However, their changes to the insertion
policy are geared towards solving the problem of inaccuracyof prefetchers. With their
proposal, demand lines are still inserted in the MRU position making the cache susceptible
to thrashing by demand references. Our proposal, LIP, inserts all the incoming line in
the LRU position thus protecting against thrashing by targetin the fundamental locality
problem that exists in memory reference streams. As shown inSection 4.5.5, our work
can be combined with Lin et al.’s work to protect the cache from both thrashing as well as
prefetcher pollution.
4.7 Summary
The commonly used LRU replacement policy performs poorly for memory-intensive
workloads that reuse a working set greater than the available cache size. The LRU policy
inserts a line and evicts it before it is likely to be reused causing a majority of the lines
in the cache to have zero reuse. In such cases, retaining somefraction of the working set
would provide hits for at least that fraction of the working set. We separate the problem
of replacement into two parts:victim selection policyand insertion policy. Victim selec-
tion deals with which line gets evicted to install the incoming line. The insertion policy
deals with where on the replacement stack the incoming line is placed when installing it in
the cache. This chapter show that simple changes to the insertion policy can significantly
improve the cache performance of memory-intensive workloads, nd make the following
contributions:
62
1. We propose the LRU Insertion Policy (LIP) which inserts all the incoming lines in the
LRU position of the recency stack. We show that LIP can protect against thrashing
and yields close to optimal hit-rate for applications with cyclic reference pattern.
2. We propose the Bimodal Insertion Policy (BIP) as an enhancement to LIP that allows
for aging and adapting to changes in the working set of an application. BIP infre-
quently inserts an incoming line in the MRU position, which allows it to respond to
changes in the working set while retaining the thrashing protection of LIP.
3. We propose a Dynamic Insertion Policy (DIP) that dynamically chooses between
BIP and traditional LRU replacement. DIP uses BIP for workloads that benefit from
BIP while retaining traditional LRU for workloads that are LRU-friendly and incur
increased misses with BIP.
4. We propose In-cache Dynamic Set Sampling (IDSS) to implement cost-effective dy-
namic selection between competing policies. IDSS dedicates small percentage of
sets in the cache to each of the two component policies and chooses the policy that
has fewer misses on the dedicated set for the remaining follower sets. IDSS does not





Currently, processors are supported by large on-chip caches that try to provide faster
access to recently-accessed data. Unfortunately, when there is a miss at the largest on-chip
cache, instruction processing stalls after a few cycles [34], and the processing resources
remain idle for hundreds of cycles [84]. The inability to process instructions in parallel
with long-latency cache misses results in substantial performance loss. One way to reduce
this performance loss is to process the cache misses in parallel.1 Techniques such as non-
blocking caches [39], out-of-order execution with large instruction windows, runahead ex-
ecution [20][51], and prefetching improve performance by parallelizing long-latency mem-
ory operations. The notion of generating and servicing multiple outstanding cache misses
in parallel is calledMemory Level Parallelism(MLP) [23].
5.1.1 Not All Misses are Created Equal
Servicing misses in parallel reduces the number of times theprocessor has to stall
due to a given number of long-latency memory accesses. However, MLP is not uniform
across all memory accesses in a program. Some misses occur inisolation (e.g., misses due
1Multiple concurrent misses to the same cache block are treated as a single miss.Parallel missrefers to a
miss that is serviced while there is at least one more miss outstanding.Isolated missrefers to a miss that is
not serviced concurrently with any other miss.
64
to pointer-chasing loads), whereas some misses occur in parallel with other misses (e.g.,
misses due to array accesses). The performance loss resulting from a cache miss is reduced
when multiple cache misses are serviced in parallel becausethe idle cycles waiting for
memory get amortized over all the concurrent misses. Isolated misses hurt performance
the most because the processor is stalled to service just a single miss. The non-uniformity
in MLP and the resultant non-uniformity in the performance impact of cache misses opens
up an opportunity for cache replacement policies that can take advantage of the variation in
MLP. Cache replacement, if made MLP-aware, can save isolated (relatively more costly)
misses instead of parallel (relatively less costly) misses.
Unfortunately, traditional cache replacement algorithmsare not aware of the dis-
parity in performance loss that results from the variation in MLP among cache misses.
Traditional replacement schemes try to reduce the absolutenumber of misses with the im-
plicit assumption that reduction in misses correlates withreduction in memory related stall
cycles. However, due to the variation in MLP, the number of misses may or may not corre-
late directly with the number of memory related stall cycles. We demonstrate how ignoring
MLP information in replacement decisions hurts performance with the following example.
Figure 5.1(a) shows a loop containing 11 memory references.There are no other memory
access instructions in the loop and the loop iterates many times.
Let K (K > 4) be the size of the instruction window of the processor on which the
loop is executed. Points A, B, C, D, and E each represent an interval of at least K instruc-
tions. Between point A and point B, accesses to blocks P1, P2,P3, and P4 occur in the
instruction window at the same time. If these accesses result in multiple misses then those
misses are serviced in parallel, stalling the processor only ce for the multiple parallel
misses. Similarly, accesses between point B and point C willlead to parallel misses if there
is more than one miss, stalling the processor only once for all the multiple parallel misses.
Conversely, accesses to block S1, S2, or S3 result in isolated misses and the processor will
65
Stall Stall Stall Stall
StallStallStallStallStall
Stall
P1 P2 P3 S3 P1 P2 P3 S3P1 P2 P3 P4 P1 P2 P3 S1 P1 P2 P3 S2
S1 S2 S3 P1 S1 S2 S3 P4 S1 S2 S3 P1
B C D E
A B C D E










(b) Execution timeline for one iteration with Belady’s OPT replacement
A A’
to a P block (P1,P2,P3,P4) can be serviced in parallel with misses to other P blocks.
(a) An access pattern.  A miss to an S block (S1,S2,S3) results in an Isolated miss.  A miss














Cycles Saved Misses/Iteration: 6
Stalls/Iteration: 2
P4 P3 P2 P1P1 P2 P3 P4 S1 S2 S3
Figure 5.1: The drawback of not including MLP information inreplacement decisions.
be stalled once for each such miss. We analyze the behavior ofthis access stream for a
fully-associative cache that has space for four cache blocks, assuming the processor has
already executed the first iteration of the loop.
First, consider a replacement scheme which tries to minimize the absolute number
of misses, without taking MLP information into account. Belady’s OPT [7] provides a
theoretical minimum for the number of misses by evicting a block that is accessed furthest
in the future. Figure 5.1(b) shows the behavior of Belady’s OPT for the given access
stream. At point B, blocks P1, P2, P3, and P4 were accessed in the mmediate past and will
be accessed again in the immediate future. Therefore, the cache contains blocks P1, P2, P3,
and P4 at point B. This results in hits for the next accesses toblocks P4, P3, P2, and P1, and
misses for the next accesses to blocks S1, S2, and S3. To guarantee the minimum number
of misses, Belady’s OPT evicts P4 to store S1, S1 to store S2, and S2 to store S3. Since the
misses to S1, S2, and S3 are isolated misses, the processor incurs three long-latency stalls
66
between points C and A’. At point A, the cache contains P1, P2,P3, and S3 which results
in a miss for P4, stalling the processor one more time. Thus, for each iteration of the loop,
Belady’s OPT causes four misses (S1, S2, S3, and P4) and four long-latency stalls.
Second, consider a simple MLP-aware policy, which tries to reduce the number of
isolated misses. This policy keeps in cache the blocks that lead to isolated misses (S1,
S2, S3) rather than the blocks that lead to parallel misses (P1, P2, P3, P4). Such a policy
evicts the least-recently used P-block from the cache. However, if there is no P-block in
the cache, then it evicts the least-recently used S-block. Figure 5.1(c) shows the behavior
of such an MLP-aware policy for the given access stream. The cache has space for four
blocks and the loop contains only 3 S-blocks (S1, S2, and S3).Therefore, the MLP-aware
policy never evicts an S-block at any point in the loop. Afterthe first loop iteration, each
access to S1, S2, and S3 results in a hit. At point A, the cache contains S1, S2, S3, and P1.
From point A to B, the access to P1 hits in the cache, and the accsses to P2, P3, and P4
miss in the cache. However, these misses are serviced in parallel, therefore the processor
incurs only one long-latency stall for these three misses. The cache evicts P1 to store P2,
P2 to store P3, and P3 to store P4. So, at point B, the cache contains S1, S2, S3, and P4.
Between point B and point C, the access to block P4 hits in the cache, while accesses to
blocks P3, P2, and P1 miss in the cache. These three misses areagain serviced in parallel,
which results in one long-latency stall. Thus, for each loopiteration, the MLP-aware policy
causes six misses ([P2, P3, P4] and [P3, P2, P1]) and only two long-latency stalls.
Note that Belady’s OPT uses oracle information, whereas theMLP-aware scheme
uses only information that is available to the microarchitecture. Whether a miss is serviced
in parallel with other misses can easily be detected in the memory system, and the MLP-
aware replacement scheme uses this information to make replacement decisions. For the
given example, even with the benefit of an oracle, Belady’s OPT incurs twice as many long-
67
latency stalls compared to a simple MLP-aware policy.2 This simple example demonstrates
that it is important to incorporate MLP information into replacement decisions.
5.1.2 Contributions
Based on the observation that the aim of a cache replacement policy is to reduce
memory related stalls, rather than to reduce the raw number of misses, this chapter proposes
MLP-aware cache replacement and make the following contributions:
1. As a first step to enable MLP-aware cache replacement, we propose a run-time algo-
rithm that can compute MLP-based cost for in-flight misses.
2. We show that, for most benchmarks, the MLP-based cost repeats for consecutive
misses to individual cache blocks. Thus, the last-time MLP-based cost can be used
as a predictor for the next-time MLP-based cost.
3. We propose a simple replacement policy called the Linear (LIN) policy which takes
both recency and MLP-based cost into account to implement a practical MLP-aware
cache replacement scheme. Evaluation with the SPEC CPU2000benchmarks shows
performance improvement of up to 23% with the LIN policy.
4. The LIN policy does not perform well for benchmarks in which the MLP-based cost
differs significantly for consecutive misses to an individual cache block. We propose
a cost-effective hybrid replacement policy to select betwen LIN and LRU, depend-
ing on which policy results in the least number of memory related stall cycles.
2We use Belady’s OPT in the example only to emphasize that the concept of reducing the number of misses
and making the replacement scheme MLP-aware are independent. However, Belady’s OPT is impossible
to implement because it requires knowledge of the future. Therefore, we will use LRU as the baseline
replacement policy for the remainder of this paper. For the LRU policy, each iteration of the loop shown in
Figure 5.1 causes six misses ([P2, P3, P4], S1, S2, S3) and four long-latency stalls.
68
5.2 Background
Out-of-order execution inherently improves MLP by continuing to execute instruc-
tions after a long-latency miss. Instruction processing stop only when the instruction win-
dow becomes full. If additional misses are encountered before the window becomes full,
then these misses are serviced in parallel with the stallingmiss. The analytical model of
out-of-order superscalar processors proposed by Karkhanis and Smith [35] provides fun-
damental insight into how parallelism in L2 misses can reducthe cycles per instruction
incurred due to L2 misses.
The effectiveness of an out-of-order engine’s ability to increase MLP is limited by
the instruction window size. Several proposals [51][2][188 ] have looked at the problem
of scaling the instruction window for out-of-order processors. Chou et al. [16] analyzed
the effectiveness of different microarchitectural techniques such as out-of-order execution,
value prediction, and runahead execution on increasing MLP. They concluded that mi-
croarchitecture optimizations can have a profound impact on increasing MLP. They also
formally defined instantaneous MLP asthe number of useful long-latency off-chip ac-
cesses outstanding when there is at least one such access outstanding. MLP can also be
improved at the compiler level. Read miss clustering [55] isa compiler technique in which
the compiler reorders load instructions with predictable access patterns to improve memory
parallelism.
All of the techniques described thus far try to improve MLP byoverlapping long-
latency memory operations. MLP is not uniform across all memory accesses in a program
though. While some of the misses are parallelized, many misses till occur in isolation.
It makes sense to make this variation in MLP visible to the cache replacement algorithm.
Cache replacement, if made MLP-aware, can increase performance by reducing the num-
ber of isolated misses at the expense of parallel misses. To our knowledge no previous
research has looked at including MLP information in replacement decisions. Srinivasan et
69
al. [75][74] analyzed the criticality of load misses for out-of-order processors. But, criti-
cality and MLP are two different properties. Criticality, as defined in [75], is determined by
how long instruction processing continues after a load miss, whereas, MLP is determined
by how many additional misses are encountered while servicing a miss.
Cost-sensitive replacement policies for on-chip caches were investigated by Jeong
and Dubois [29][30]. They proposed variations of LRU that takecost(any numerical prop-
erty associated with a cache block) into account. In general, any cost-sensitive replacement
scheme, including the ones proposed in [30], can be used for implementing an MLP-aware
replacement policy. However, to use any cost-sensitive replac ment scheme, we first need
to define thecostof each cache block based on the MLP with which it was serviced. As
the first step to enable MLP-aware cache replacement, we introduce a run-time technique
to compute MLP-based cost.
5.3 Computing MLP-Based Cost
For current instruction window sizes, instruction processing stalls shortly after a
long-latency miss occurs. The number of cycles for which a miss stalls the processor can
be approximated by the number of cycles that the miss spends waiting to get serviced. For
parallel misses, the stall cycles can be divided equally among all concurrent misses.
5.3.1 Algorithm
The information about the number of in-flight misses and the number of cycles a
miss is waiting to get serviced can easily be tracked by the MSHR (Miss Status Holding
Register). Each miss is allocated an MSHR entry before a request to service that miss is
sent to memory [39]. To compute the MLP-based cost, we add a field mlp cost to each
MSHR entry. Algorithm 1 describes the calculation of MLP-based cost of a cache miss.
70
Algorithm 1 Calculate MLP-based cost for cache misses
init mlp cost(miss): /* when miss enters MSHR */
miss.mlp cost = 0
update mlp cost( ): /* called every cycle */
N ⇐ Number of outstanding demand misses in MSHR
for each demand miss in the MSHR
miss.mlp cost += (1/N)
When a miss is allocated an MSHR entry, the mlpcost field associated with that
entry is initialized to 0. We count instruction accesses, load accesses, and store accesses
that miss in the largest on-chip cache as demand misses. All misses are treated on correct
path until they are confirmed to be on the wrong path. Misses onthe wrong path are not
counted as demand misses. Each cycle, the mlpcost of all demand misses in the MSHR is
incremented by the amount1/(Number of outstanding demand misses in MSHR).3,4 When
a miss is serviced, the mlpcost field in the MSHR represents the MLP-based cost of that
miss. Henceforth, we will usemlp-cost to denote MLP-based cost.
5.3.2 Distribution of mlp-cost
Figure 5.2 shows the distribution ofmlp-cost for 14 SPEC benchmarks measured
on an eight-wide issue, out-of-order processor with a 128-entry instruction window. An
isolated miss takes 444 cycles (400-cycle bank access + 44-cycle bus delay) to get serviced.
The vertical axis represents the percentage of all misses and the horizontal axis corresponds
3The number of adders required for the proposed algorithm is equal to the number of MSHR entries.
However, for the baseline machine with 32 MSHR entries, timesharing four adders among the 32 entries has
only a negligible effect on the absolute value of the MLP-based cost. For all our experiments, we assume that
the MSHR contains only four adders for calculating the MLP-based cost. If more than four MSHR entries
are valid, then the adders are time-shared between all the valid entries using a simple round-robin scheme.
4We also experimented by increasing the mlpcost only during cycles when there is a full window stall.
However, we did not find any significant difference in the relative value of mlpcost or the performance
improvement provided by our proposed replacement scheme. Th refore, for simplicity, we assume that the












































































































































































































































































































































Figure 5.2: Distribution ofmlp-cost. The horizontal axis represents the value ofmlp-cost in cycles
and the vertical axis represents the percentage of total misses. The dot on the horizontal axis represents the
average value ofmlp-cost.
to different values ofmlp-cost. The graph is plotted with 60-cycle intervals, with the
leftmost bar representing the percentage of misses that hada v lue of 0≤ mlp-cost < 60
cycles. The rightmost bar represents the percentage of all misses that had anmlp-cost of
more than 420 cycles. All isolated misses (and some parallelmisses that are serialized
because of DRAM bank conflicts) are accounted for in the right-most bar.
72
For each benchmark, the average value ofmlp-cost is much less than 444 cycles
(number of cycles needed to serve an isolated miss). For art,mo e than 85% of the misses
have anmlp-cost of less than 120 cycles indicating a high parallelism in misses. For mcf,
about 40% of the misses have anmlp-cost between 180 and 240 cycles, which corresponds
to two misses in parallel. Mcf also has about 9% of its misses ai olated misses. Facerec
has two distinct peaks, one for the misses that occur in isolat on and the other for the
misses that occur with a parallelism of two. Twolf, vpr, facerec, and parser have a high
percentage of isolated misses and hence the peak for the rightmost bar. The results for all
of these benchmarks clearly indicate that there exists non-uniformity in mlp-cost which
can be exploited by MLP-aware cache replacement. The objective of MLP-aware cache
replacement is to reduce the number of isolated (i.e., relativ ly more costly) misses without
substantially increasing the total number of misses.mlp-cost can serve as a useful metric
in designing an MLP-aware replacement scheme. However, forthe decision based onmlp-
cost to be meaningful, we need a mechanism to predict the futuremlp-cost of a miss
given the currentmlp-cost of a miss. For example, a miss that happens in isolation once
can happen in parallel with other misses the next time, leading to significant variation in
themlp-cost for the miss. Ifmlp-cost is not predictable for a cache block, the information
provided by themlp-cost metric is not useful. The next section examines the predictability
of mlp-cost.
5.3.3 Predictability of the mlp-cost metric
One way to predict the futuremlp-cost value of a block is to use the currentmlp-
cost value of that block. The usefulness of this scheme can be evaluated by measuring
the difference between themlp-cost for successive misses to a cache block. We call the
absolute difference in the value ofmlp-cost for successive misses to a cache block asdelta.
For example, let cache block A havemlp-cost values of{444 cycles, 80 cycles, 80 cycles,
73
220 cycles} for the four misses it had in the program. Then, the first deltafor block A is
364 (‖444−80‖) cycles, the second delta for block A is 0 (‖80−80‖) cycles, and the third
delta for block A is 140 (‖80 − 220‖) cycles. To measure delta, we do an off-line analysis
of all the misses in the program. Table 5.1 shows the distribution of delta. A small delta
value means thatmlp-cost does not significantly change between successive misses to a
given cache block.
Table 5.1: Repeatability ofmlp-cost. The first three columns after the benchmark name represent th
percentage of deltas that were between 0-59 cycles, 60-119 cycles, and more than 120 cycles respectively.
The last column represents the average value of delta.
Benchmark delta< 60 60≤ delta< 120 delta≥ 120 Average delta
art 86% 7% 7% 30
mcf 86% 7% 7% 21
twolf 52% 12% 36% 98
vpr 50% 14% 36% 100
facerec 96% 0% 4% 9
ammp 82% 10% 8% 32
galgel 71% 9% 20% 82
equake 78% 12% 10% 40
bzip2 43% 15% 42% 126
parser 43% 5% 52% 190
apsi 85% 5% 10% 28
sixtrack 100% 0% 0% 1
lucas 84% 6% 10% 52
mgrid 18% 16% 66% 187
swim 80% 16% 4% 41
For all the benchmarks, except bzip2, parser, and mgrid, themajority of the delta
values are less than 60 cycles. The average delta value is also fairly low, which means
that the next-timemlp-cost for a cache block remains fairly close to the currentmlp-cost.
Thus, the currentmlp-cost can be used as a predictor of the nextmlp-cost of the same
block in MLP-aware cache replacement. We describe our experimental methodology be-
fore discussing the design and implementation of an MLP-aware c che replacement scheme
based on these observations.
74
5.4 The Design of an MLP-Aware Cache Replacement Scheme
Figure 5.3 shows the microarchitecture design for MLP-aware cache replacement.
The added structures are shaded. The cost calculation logic(CCL) contains the hardware
implementation of Algorithm 1. It computesmlp-cost for all demand misses. When a miss
gets serviced, themlp-cost of the miss is stored in the tag-store entry of the corresponding
cache block. For replacement, the cache invokes the Cost Aware Replacement Engine
(CARE) to find the replacement victim. CARE can consist of anygeneric cost-sensitive
scheme [30][52]. We evaluate MLP-aware cache replacement usi g both an existing as


















Figure 5.3: Microarchitecture for MLP-aware cache replacement (Figure not to scale).
Before discussing the details of the MLP-aware replacementscheme, it is useful
to note that the exact value ofmlp-cost is not necessary for replacement decisions. In a
real implementation, to limit the storage overhead, the value ofmlp-cost can be quantized
to a few bits and the quantized value would be stored in the tag-store. We consider one
such quantization scheme. It converts the value ofmlp-cost into a 3-bit quantized value,
according to the intervals shown in Table 5.2. Henceforth, we usecostq to denote the
quantized value ofmlp-cost.
75
Table 5.2: Quantization ofmlp-cost









5.4.1 The Linear (LIN) Policy
The baseline replacement policy is LRU. The replacement functio of LRU selects
the candidate cache block with the least recency. LetV ic imLRU be the victim selected
by LRU andR(i) be the recency value (highest value denotes the MRU and lowest value
denotes LRU) of blocki. Then, the victim of the LRU policy can be written as:
V ictimLRU = arg min
i
{R(i)} (5.1)
We want a policy that takes into account bothcostq and recency. We propose a
replacement policy that employs a linear function of recency andcostq. We call this policy
the Linear (LIN) policy. The replacement function of LIN canbe summarized as follows:
Let V ictimLIN be the victim selected by the LIN policy,R(i) be the recency value of block
i, andcostq(i) be the quantized cost of blocki, then the victim of the LIN policy can be
written as:
V ictimLIN = arg min
i
{R(i) + λ · costq(i)} (5.2)
76
The parameterλ determines the importance ofcostq in choosing the replacement
victim. In case of a tie for the minimum value of{R + λ · costq}, the candidate with
the smallest recency value is selected. Note that LRU is a special case of the LIN policy
with λ = 0. With a highλ value, the LIN policy tries to retain recent cache blocks that
have highmlp-cost. For our experiments, we used the position in the LRU stack asthe
recency value (e.g. for a 16-way cache,R(MRU) = 15 andR(LRU) = 0). Sincecostq is
quantized into three bits, its range is from 0 to 7. Unless stated otherwise, we useλ = 4 in
all our experiments.
5.4.2 Results for the LIN Policy
Figure 5.4 shows the performance impact of the LIN policy fordifferent values of
λ. The effect of the LIN policy is more pronounced as the value of λ is increased from
1 to 4. With λ=4, the LIN policy provides a significant IPC improvement forart, mcf,
vpr, galgel, and sixtrack. In contrast, it degrades performance for bzip2, parser, and mgrid.
These benchmarks have high average delta values (refer to Table 5.1), so the replacement
decisions based onmlp-cost hurts performance. LIN can improve performance by reduc-
ing the number of isolated misses, or by reducing the total number of misses, or both. We
analyze the LIN policy further by comparing themlp-cost distribution of the LIN policy
with themlp-cost distribution of the baseline.
Figure 5.5 shows themlp-cost distribution for both the baseline and the LIN policy.
The inset contains information about the change in the number of misses and the change in
IPC due to LIN. For mcf, almost all the isolated misses are eliminated by LIN. For twolf,
although the total number of misses increases by 7%, IPC increases by 1.5%. A similar
trend of increase in misses accompanied by increase in IPC isobserved for ammp and
equake. For these benchmarks, the IPC improvement is comingfrom reducing the number











































































Figure 5.4: IPC improvement with LIN (λ) asλ is varied.
of misses. For all benchmarks, except art and galgel, the distribution ofmlp-cost is skewed
towards the left (i.e. lowermlp-cost) for the LIN policy when compared to the baseline.
This indicates that LIN -successfully- has a bias towards reucing the proportion of high
mlp-cost misses.
For art, galgel, and sixtrack, LIN reduces the total number of misses by more than
30%. This happens for applications that have very large dataworking-sets with low tem-
poral locality, causing LRU to perform poorly [56][85]. TheLIN policy automatically
provides filtering for access streams with low temporal locaity by at least keeping some
of the highmlp-cost blocks in the cache, when LRU could have potentially caused thrash-
ing. The large reduction in the number of misses for art and galgel reduces the parallelism
with which the remaining misses get serviced. Hence, for both art and galgel, the average




































































































































































































































































































































































































































































































































































































































































































































Figure 5.5: Distribution ofmlp-cost for baseline and LIN (λ = 4). The horizontal axis
represents the value ofmlp-cost in cycles and the vertical axis represents the percentage ofall misses. The dot
on the horizontal axis represents the average value ofmlp-cost. The insets in the graphs contain information
about the change in the number of misses and IPC with the use ofth LIN policy.
The LIN policy tries to retain recent cache blocks that have high mlp-cost values.
The implicit assumption is that the blocks that had highmlp-cost at the time they were
brought in the cache will continue to have highmlp-cost the next time they need to be
fetched. Therefore, the LIN policy performs poorly for benchmarks in which currentmlp-
cost is not a good indicator of the next-timemlp-cost. Examples of such benchmarks are
bzip2 (average delta = 126 cycles), parser (average delta = 190 cycles), and mgrid (average
delta = 187 cycles). For these benchmarks, the number of misses ncreases significantly
with the LIN policy. For the LIN policy to be useful for a wide variety of applications,
we need a feedback mechanism that can limit the performance degra ation caused by LIN.
This can be done by dynamically choosing between the baseline LRU policy and the LIN
policy depending on which policy is doing better. The next section presents a novel, low-
overhead adaptation scheme that provides such a capability.
80
5.5 Cost-Sensitive Hybrid Replacement
LIN performs better on some benchmarks and LRU performs better on some bench-
marks. We want a mechanism that can dynamically choose the replacement policy that pro-
vides higher performance, or equivalently fewer memory related stall cycles. The SBAR
policy proposed in Chapter 3 can be modified to choose the component policy that incurs
the minimum aggregate cost of the miss. The next section describ cost-sensitive tourna-
ment selection which can be used with SBAR to select between LIN and LRU.
5.5.1 Cost-Sensitive Tournament Selection of ReplacementPolicy
Let MTD be the main tag directory of the cache. For facilitating hybrid replace-
ment, MTD is capable of implementing both LIN and LRU. MTD is appended with two
Auxiliary Tag Directories (ATDs): ATD-LIN and ATD-LRU. ATD-LIN implements only
the LIN policy, and ATD-LRU implements only the LRU policy. Asaturating counter
(PSEL) keeps track of which of the two ATDs is doing better. The access stream visible to
MTD is also fed to both ATD-LIN and ATD-LRU. Both ATD-LIN and ATD-LRU compete
and the output of PSEL is an indicator of which policy is doingbetter. The replacement
policy to be used in MTD is chosen based on the output of PSEL. Figure 5.6 shows the op-
eration of cost-sensitive tournament selection (TSEL) mechanism for one set in the cache.
If a given access hits or misses in both ATD-LIN and ATD-LRU, neither policy
is doing better than the other. Thus, PSEL remains unchanged. If an access misses in
ATD-LIN but hits in ATD-LRU, LRU is doing better than LIN for that access. In this case,
PSEL is decremented by a value equal to thecostq of the miss (a 3-bit value) incurred
by ATD-LIN. Conversely, if an access misses in ATD-LRU but hits in ATD-LIN, LIN is
doing better than LRU. Therefore, PSEL is incremented by a value equal to thecostq of
the miss incurred by ATD-LRU. Unless stated otherwise, we usa 6-bit PSEL counter in










of Miss in ATD−LIN




else MTD uses LRU
MTD uses LIN 





HIT MISS of Miss in ATD−LRU
Increment PSEL by cost  
Decrement PSEL by cost  
−
Figure 5.6: Cost-sensitive Tournament Selection for a single set.
Only accesses that result in a miss for MTD are serviced by thememory system.
If an access results in a hit for MTD but a miss for either ATD-LIN or ATD-LRU, then
it is not serviced by the memory system. Instead, the ATD thatincurred the miss finds
a replacement victim using its replacement policy. The tag field associated with the re-
placement victim of the ATD is updated. The value ofc stq associated with the block is
obtained from the corresponding tag-directory entry in MTD.
If LIN reduces memory related stall cycles more than LRU, then PSEL will be
saturated towards its maximum value. Similarly, PSEL will be saturated towards zero if
the opposite is true. If the most significant bit (MSB) of PSELis 1, the output of PSEL
indicates that LIN is doing better. Otherwise, the output ofPSEL indicates that LRU is
doing better. Note that PSEL is incremented or decremented by costq instead of by 1,
which results in selection based on the cumulative value of MLP-based cost of misses (i.e.,
∑
costq), rather than the raw number of misses. This is an important factor in the TSEL
mechanism that allows TSEL to select the policy that resultsin the smallest number of
stall cycles, rather than the smallest number of misses. If the value ofcostq is constant or
random, then the adaptation mechanism automatically degenerates to selecting the policy
that results in the smallest number of misses.
82
5.5.2 Sampling Based Adaptive Replacement
The cost-sensitive tournament selection can be extended tothe whole cache by us-
ing the SBAR mechanism of Chapter 3. Figure 5.7 shows the cost-sensitive SBAR selection
between LIN and LRU for a cache containing eight sets. The sets in MTD are logically di-
vided into two categories:Leader SetsandFollower Sets. The leader sets in MTD use only
the LIN policy for replacement and participate in updating the PSEL counter. The follower
sets implement both the LIN and the LRU policies for replacement and use the PSEL out-
put to choose their replacement policy. The follower sets donot update the PSEL counter.
There is only a single ATD, ATD-LRU. ATD-LRU implements onlythe LRU policy and
has only sets corresponding to the leader sets. The leader sets are chosen using the simple











for Follower Sets in MTD
LEGEND FOR MTD
Miss in Leader Sets of MTD
Follower Sets
Have ATD−LRU entries
Always follow LIN policy









Figure 5.7: Cost-sensitive SBAR selection of LIN vs. LRU fora cache that has eight sets.
83
5.5.3 Results for the SBAR Mechanism
Figure 5.8 shows the IPC improvement over the baseline configuration when the
SBAR mechanism is used to dynamically choose between LRU andLIN. For comparison,








































































Figure 5.8: IPC improvement with the SBAR mechanism.
For art, mcf, vpr, facerec, sixtrack, and apsi, SBAR maintains the IPC improve-
ment provided by LIN. The most important contribution of SBAR is that it eliminates the
performance degradation caused by LIN on bzip2, parser, mgrid, and swim. For these
benchmarks, the PSEL in the SBAR mechanism is almost always bi sed towards LRU.
The marginal performance loss in these three benchmarks is because the leader sets in
MTD still use only LIN as their replacement policy. For ammp and galgel, the SBAR pol-
icy does better than either LIN or LRU alone. This happens because in some phases of the
program LIN does better, while in others LRU does better. With SBAR, the cache is able
to select the policy better suited for each phase, thereby allowing it to outperform either
policy implemented alone. In Section 5.6.1, we analyze the ability of SBAR to adapt to
varying program phases using ammp as a case study.
84
5.5.4 Effect of Leader Set Selection Policies and DifferentNumber of Leader Sets
To analyze the effect of leader set selection policies, we introduce a runtime policy,
rand-runtime. Rand-runtime randomly selects one set from each constituency as the leader
set. In our experiments, we invoke rand-runtime once every 25M instructions and mark the
sets chosen by rand-runtime as leader sets for the next 25M instructions. Figure 5.9 shows
the performance improvement for the SBAR policy with the simple-static policy and the




























(a) simple-static; 8 leader sets
(b) rand-runtime; 8 leader sets
(c) simple-static; 16 leader sets
(d) rand-runtime; 16 leader sets
(e) simple-static; 32 leader sets







































Figure 5.9: Performance impact of SBAR for different leaders t selection policies and
different number of leader sets.
For all benchmarks, except ammp, the IPC improvement of SBARis relatively in-
sensitive to both the leader set selection policy and the number of leader sets. In most
benchmarks, one replacement policy does overwhelmingly better than the other. This
causes almost all the sets in the cache to favor one policy. Hence, even as few as eight
leader sets are sufficient, and the simple-static policy works well. For ammp, the rand-
runtime policy performs better than the simple-static policy when the number of leader sets
85
is 16 or smaller. This is because ammp has widely-varying demand across different cache
sets, which is better handled by the random selection of the rand- untime policy than the
rigid static selection of the simple-static policy. However, when the number of leader sets
increases to 32, the effect of the set selection policy is less pronounced, and there is hardly
any performance difference between the two set selection policies. Due to its simplicity,
we use the simple-static policy with 32 leader sets as default in a l our SBAR experiments.
We also compared SBAR to TSEL-global and found that, except for ammp, the IPC
increase provided by SBAR is within 1% of the TSEL-global policy (we use a seven-bit
PSEL for TSEL-global). For ammp, TSEL-global improves IPC by 20.3% while SBAR
improves IPC by 18.3%. However, SBAR requires 64 times fewerATD entries than TSEL-
global, making it a much more practical solution.
5.6 Analysis
5.6.1 Ammp: A Case Study for Dynamic Adaptation of SBAR
For ammp, SBAR improves IPC by 18.3% over the baseline LRU policy while the
LIN policy improves IPC by only 4.2%. This difference in IPC improvement between
SBAR and LIN is because ammp has two distinct phases: in one phase LIN performs
better than LRU and in the other LRU performs better than LIN.To view this time-varying
phase behavior, we collected statistics from the cache every 10M retired instructions during
simulation. Figure 5.10(a) shows the averagecostq per miss, Figure 5.10(b) shows the
misses per 1000 retired instructions, and Figure 5.10(c) show the IPC for three different
policies: LRU, LIN, and SBAR over time during the simulationruns.
86











































































Figure 5.10: Comparison of LRU, LIN, and SBAR for the ammp benchmark in terms of:
(a) the average cost of misses, (b) the number of misses per 1000 instructions, and (c) IPC.
87
As expected, LIN results in lowercostq per miss than LRU throughout the whole
simulation, indicating that the LIN policy is successful atreducing thecostq of misses.
However, this reduction can come at the expense of significantly increasing the raw num-
ber of misses, which may negatively impact the IPC. Until 150M instructions, this is not a
problem: LIN has both lowercostq per miss and fewer misses than LRU. Therefore, the
IPC with LIN is much better than the IPC with LRU for the first 150M instructions. How-
ever, after 150M instructions, LIN has significantly more misses than LRU, which reduces
the IPC for the LIN policy compared to LRU. With SBAR, the cache dynamically adapts
and uses the policy that is best suited for each phase: LIN until 150M instructions and LRU
after 150M instructions. Therefore, SBAR provides higher prformance than both LIN and
LRU.
5.6.2 Hardware Cost of MLP-Aware Replacement
The performance improvement of MLP-aware replacement comes at a small hard-
ware overhead. For each entry in the MSHR, an additional bitsare required to storemlp-
cost. We assume that each MSHR entry stores themlp-cost in a 9.5 fixed point format,
where 9 bits are used to encode the integer part and 5 bits are used to encode the fractional
part, thus 14 bits are required per each MSHR entry to computethemlp-cost. Also,costq
is stored in each tag-store entry in the cache, increasing the size of each tag-store entry by
three bits. If SBAR is used to adaptively choose between LRU and LIN, then additional
storage is required for the ATD entries. The hardware overhead of the ATD is 1856 bytes,
which is less than 0.2% of the total area of the baseline L2 cache.
5.6.3 MLP-Aware Replacement using Existing Cost-Sensitive Replacement Policy
We proposed the SBAR mechanism to implement a MLP-aware cache replacement
policy. However, the central idea of this chapter, MLP-aware cache replacement, is not lim-
88
ited in implementation to the proposed SBAR mechanism. Our framework for MLP-aware
cache replacement makes even existing cost-sensitive replacement policies applicable to the
MLP domain. As an example, we use Adaptive Cost-Sensitive LRU (ACL) [30] to imple-
ment an MLP-aware replacement policy. ACL was proposed for cost-sensitive replacement
in Non-Uniform Memory Access (NUMA) systems and used the memory access latency as
thecostparameter. Similarly, MLP information about a cache block can also be used as the
costparameter in ACL. Figure 5.11 shows the performance improvement of an MLP-aware





























MLP-aware replacement using SBAR






































Figure 5.11: MLP-aware replacement using different cost-sensitive policies.
MLP-aware replacement improves performance for both impleentations: ACL
and SBAR, indicating that MLP-aware replacement works withboth existing (ACL) and
proposed (SBAR) cost-sensitive polices. However, SBAR hashigher performance and sub-
stantially lower hardware overhead than ACL , which makes SBAR a much more favorable
89
candidate for implementing MLP-aware cache replacement. The cost-sensitive policy em-
ployed by ACL requires a shadow directory on a per-set basis.For the baseline 16-way
cache, ACL needs a 15-way shadow directory [30]. Assuming a 40-bit physical address
space, each entry in the shadow directory needs four bytes ofstorage (24-bit tag + 1 valid
bit + 4 LRU bits + 3 cost bits = 4B). Thus, the total overhead of the shadow directory is
60kB (4B/entry * 15 entries/set * 1024 sets = 60 kB). Comparatively, the overhead of SBAR
is only 1856B (see Table 4), which is 33 times smaller than theoverhead of ACL. Because
ACL requires shadow directory information on a per-set basis, it is not straightforward to
use dynamic set sampling to reduce the storage overhead of ACL.
5.7 Summary
Memory Level Parallelism (MLP) varies across different misse of an application,
causing some misses to be more costly on performance than others. The non-uniformity in
the performance impact of cache misses can be exposed to the cac replacement policy so
that it can improve performance by reducing the number of costly misses. Based on this
observation, we propose MLP-aware cache replacement. We pres nt a run-time technique
to compute the MLP-based cost for each cache block. This costmetric is used to drive
cost-sensitive cache replacement policies. We also extendthe Sampling Based Adaptive
Replacement (SBAR) to dynamically choose between an MLP-aware replacement policy
(LIN) and a traditional (LRU) replacement policy, depending on which one is providing
fewer memory related stalls.
90
Chapter 6
Utility Based Partitioning of Shared Caches
This chapter investigates the problem of partitioning a shared cache between mul-
tiple concurrently executing applications. The commonly used LRU policy implicitly par-
titions a shared cache on a demand basis, giving more cache resourc s to the application
that has a high demand and fewer cache resources to the application that has a low demand.
However, a higher demand for cache resources does not alwayscorrelate with a higher per-
formance from additional cache resources. It is beneficial for performance to invest cache
resources in the application that benefits more from the cache resources rather than in the
application that has more demand for the cache resources.
This chapter proposesUtility-Based Cache Partitioning (UCP), a low-overhead,
runtime mechanism that partitions a shared cache between multiple applications depend-
ing on the reduction in cache misses that each application islikely to obtain for a given
amount of cache resources. The proposed mechanism monitorseach application at runtime
using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The
information collected by the monitoring circuits is used bya partitioning algorithm to de-
cide the amount of cache resources allocated to each application. Our evaluation, with 20
multiprogrammed workloads, shows that UCP improves performance of a dual-core system
by up to 23% and on average 11% over LRU-based cache partitioning.
91
6.1 Introduction
Modern processors contain multiple cores which enables them to concurrently ex-
ecute multiple applications (or threads) on a single chip. As the number of cores on a chip
increases, the pressure on the memory system to sustain the memory requirements of all
the concurrently executing applications (or threads) increases. One of the keys to obtaining
high performance from multicore architectures is to managethe largest level on-chip cache
efficiently so that off-chip accesses are reduced. This chapter investigates the problem of
partitioning the shared largest-level on-chip cache amongmultiple competing applications.
Traditional design for on-chip cache uses the LRU (or an approximation of LRU)
policy for replacement decisions. The LRU policy implicitly partitions a shared cache
among the competing applications on a demand1 basis, giving more cache resources to
the application that has a high demand and fewer cache resources to the application that
has a low demand. However, the benefit (in terms of reduction in misses or improvement
in performance) that an application gets from cache resources may not directly correlate
with its demand for cache resource. For example, a streamingapplication can access a
large number of unique cache blocks but these blocks are unlikely to be reused again if the
working set of the application is greater than the cache size. Although such an application
has a high demand, devoting a large amount of cache will not improve its performance.
Thus, it makes sense to partition the cache based on how much the application is likely to
benefit from the cache rather than the application’s demand for the cache.
1Demand is determined by the number of unique cache blocks accessed in a given interval [19]. Consider
two applicationsA andB sharing a fully-associative cache containingN blocks. Then with LRU replace-
ment, the number of cache blocks that each application receiv s is decided by the number of unique blocks
accessed by each application in the last N unique accesses tothe cache. IfUA is the number of unique blocks
accessed by application A in the last N unique accesses to thecache, then application A will receiveUA cache
blocks out of the N blocks in the cache.
92




























































Figure 6.1: A case for utility based cache partitioning : (a)MPKI and (b) CPI as cache size
is varied when vpr and equake are executed separately.The horizontal axis shows the number of
ways allocated from a 16-way 1MB cache (remaining ways are turned off).
We explain the problem with LRU-based partitioning with a numerical example.
Figure 6.1(a) shows the number of misses for two SPEC benchmarks, vpr and equake,
as the cache size is varied, when each one is run separately. Wvary the cache size by
changing the number of ways and keeping the number of sets constant. The baseline L2
cache in our experiments is 16-way, 1MB in size and contains 1024 sets (other parameters
of the experiment are described in Section 6.4). For vpr, thenumber of misses reduce
monotonically as the cache size is increased from 1 way to 16 ways. For equake, the
number of misses decrease as the number of allocated ways increase from 1 to 3, but
increasing the cache size by more than 3 ways does not decreasmis es. Thus, equake has
no benefit orutility for cache resources in excess of three ways.
When vpr and equake are run together on a dual-core system, sharing the baseline
1MB 16-way cache, the LRU policy allocates, on average, 7 ways to equake and 9 ways
93
to vpr. If cache partitioning was based on utility (UTIL) of cache resources, then equake
would get only 3 ways and vpr would get the remaining 13 ways. Decreasing the cache
resources devoted to equake from 7 ways to 3 ways does not increase its misses but in-
creasing the cache resources devoted to vpr from 9 ways to 13 ways reduces its misses. As
shown in Figure 6.1(b), partitioning the cache based on utility information can potentially
reduce the CPI of vpr from 2 to 1.5 without affecting the CPI ofequake, improving the
overall performance of the dual-core system.
To partition the cache based on application’s utility for the cache resource, we pro-
poseUtility-Based Cache Partitioning (UCP). An important component of UCP is the mon-
itoring circuits that can obtain the information about utility of cache resource for all the
competing applications at runtime. For the UCP scheme to be practical, it is important that
the utility monitoring (UMON) circuits are not hardware-intensive or power-hungry. Sec-
tion 6.3 describes a novel, low-overhead, UMON circuit thatrequires a storage overhead of
only 1920B (less than 0.2% area of the baseline 1MB cache). The information collected by
UMON is used by a partitioning algorithm to decide the amountof cache allocated to each
competing application. Our evaluation in Section 6.5 showsthat UCP outperforms LRU,
improving the performance of a dual-core system by up to 23%,and on average 11%.
The number of possible partitions increases exponentiallywith the number of appli-
cations sharing the cache. It becomes impractical for the partitioning algorithm to find the
best partition by evaluating every possible partition, when a large number of applications
share a highly associative cache. In Section 6.6, we proposetheLookahead Algorithmas a
scalable alternative to evaluating all the possible partitions for partitioning decisions.
94
6.2 Motivation and Background
Caches improve performance by reducing the number of main memory accesses.
Thus, the utility of cache resources for an application can be directly correlated to the
change in the number of misses or improvement in performanceof the application when
the cache size is varied. Figure 6.2, Figure 6.3, and Figure 6.4 shows the misses and CPI
for some of the SPEC benchmarks as a function of cache size. The utility of cache resource
varies widely across applications. The applications are classified into three categories based
on how much each of them benefits as the cache size is increasedfrom 1 way to 16 ways
(keeping the number of sets constant).
































































































































Figure 6.2: MPKI and CPI for Low Utility Benchmarks.The horizontal axis shows the number
of ways allocated from a 16-way 1MB cache (the remaining waysare turned off).
95
Figure 6.2 contains benchmarks that do not benefit significantly s the cache size
is increased from 1 way to 16 ways. We say such applications have low utility. These
benchmarks either have large fraction of compulsory misses(e.g. gap) or have a data set
larger than the cache size2 (e.g. mcf).

































































































































Figure 6.3: MPKI and CPI for High Utility Benchmarks.The horizontal axis shows the number
of ways allocated from a 16-way 1MB cache (the remaining waysare turned off).
Benchmarks shown in Figure 6.3 continue to benefit as the cache size is increased
from 1 way to 16 ways. We say such applications havehigh utility. Benchmarks in Fig-
ure 6.4 benefit significantly as the cache size is increased from 1 way to 8 ways. These
2Applications with low utility can show a large reduction in misses when the cache size is increased such
that the dataset fits in the cache. For example, Figure 6.14 shows that the MPKI of art does not decrease when
the cache size is increased from 1 way to 8 ways (0.5MB). However, increasing the size to 24 ways (1.5MB)
reduces MPKI by a factor of 5. In such cases, the curve of MPKI vs. cache size resembles a step function.
96
benchmarks have a small working set that fits in a small size cache, therefore, giving them
more than 8 ways does not significantly improve their performance. We say such applica-
tions havesaturatingutility.

































































































































Figure 6.4: MPKI and CPI for Saturating Utility Benchmarks.The horizontal axis shows the
number of ways allocated from a 16-way 1MB cache (the remaining ways are turned off).
If two applications having low utility (e.g. mcf and applu) are executed together,
then their performance is not sensitive to the amount of cache available to each applica-
tion. Similarly, when two applications of saturating utility are executed together, then the
cache can support the working set of both applications. However, when an application with
saturating utility is run with an application with low utiliy then the cache may not hold
the working set of the application with saturating utility.Similarly, when an application
with high utility is run with any other application, its perfo mance is highly sensitive to
97
the amount of cache available to it. In such cases, it is important to partition the cache
judiciously by taking utility information into account.
Figure 6.2, Figure 6.3 and Figure 6.4 shows that in most cases,3 reduction in misses
correlates with reduction in CPI. Thus, we can use the information about reduction in
misses to make cache partitioning decisions. To include utility information in partition-
ing decisions, we provide a quantitative definition of utility for cache resources for a given
application. Since cache is allocated only on a way basis in our studies, we define utility on
a way granularity. Ifmissa andmissb are the number of misses that an application incurs
when it receivesa andb ways respectively(a < b), then the utility(U ba) of increasing the
number of ways from a to b is:
U ba = missa − missb (6.1)
Section 6.3 describes cost-effective monitoring circuitsthat can estimate the utility
(U) information for an application at run-time, along with the framework, the partitioning
algorithm, and the replacement scheme for UCP.
6.3 Utility-Based Cache Partitioning
6.3.1 Framework
Figure 6.5 shows the framework to support UCP between two applications that ex-
ecute together on a dual-core system. One of the two applications execute on CORE1 and
the other on CORE2. Each core is assigned a utility monitoring (UMON) circuit that tracks
the utility information of the application executing on it.The UMON circuit is separated
3When eight ways are allocated to swim, it sees a huge reduction in misses. However, this reduction in
misses does not translate into a substantial reduction in CPI. This happens because a set of accesses with
high memory-level parallelism (MLP) now fits in the cache which reduces the average MLP and increases
the average mlp-based cost of each miss.
98
from the shared cache, which allows the UMON circuit to obtain utility information about
an application for all the ways in the cache, independent of the contention from the applica-
tion executing on the other core. The partitioning algorithm uses the information collected
by the UMON to decide the number of ways to allocate to each core. The replacement en-











Figure 6.5: Framework for Utility-Based Cache Partitioning. Newly added structures are shaded.
6.3.2 Utility Monitors (UMON)
Monitoring the utility information of an application requires a mechanism that
tracks the number of misses for all possible number of ways. To compute the utility infor-
mation for the baseline 16-way cache, the monitoring circuit is required to track misses for
all the sixteen cases, ranging from when only 1 way is allocated to the application to when
all 16 ways are allocated to the application. A straight-forward, but expensive, method to
obtain this information is to have sixteen tag directories,each having the same number of
sets as the shared cache, but each having a different number of ways ranging from 1 way to
16 ways (note that data lines are not required to estimate hit-miss information). Although
this scheme can track utility information for any replacement scheme implemented in the
99
shared cache, the hardware overhead of multiple directories makes this scheme impractical.
Fortunately, the baseline LRU policy obeys the stack property [46], which means that an
access that hits in a LRU managed cache containing N ways is guaranteed to also hit if the
cache had more than N ways (the number of sets being constant). This means even with
a single tag directory containing sixteen ways, it is possible to compute the hit-miss infor-
mation about all the cases when the cache contains from one way through sixteen ways.
To see how the stack algorithm provides utility information, consider the example of a four









































Num. ways per set
Figure 6.6: Tracking utility information using stack property: (a) Hit counters for each
recency position (b) Obtaining utility information from hit counters using stack property.
Each set has four counters for obtaining the hit counts for each of the four recency
positions ranging from MRU to LRU. The position next to MRU inthe recency position is
referred asposition 1and the next position asposition 2. If a cache access results in a hit,
the counter corresponding to the hit-causing recency position is incremented. The counters
then represent the number of misses saved by each recency position. Figure 6.6(b) shows
an example in which out of the 100 accesses to the cache, 25 miss, 30 hit in MRU, 20 hit
in position 1, 15 hit in position 2, and the remaining 10 hit inhe LRU position. Then, if
the cache size is reduced from four ways to three ways, the misses increase from 25 to 35.
Further reducing the cache size to two ways, increases the number of misses to 50. And
100
with only one way the cache incurs 70 misses. Thus, given information about misses in a
cache that has a large number of ways, it is possible to obtainthe information about misses
for a cache with smaller number of ways.
The UMON circuit tracks the utility of each way using an Auxiliary Tag Directory
(ATD) and hit counters. The ATD has the same associativity asthe main tag directory of
the shared cache and uses the LRU policy for replacement decisions. Figure 6.7 (a) shows
a UMON that contains the hit counters for each set in the cache. We call this organization
as UMON-local. Although UMON-local can perform partitioning on a per-setbasis, it
requires a huge overhead because of the extra tag entry and hit counter for each line in
the cache. The hardware overhead of the hit counters in UMON-local can be reduced by
having one group of hit counters for all the sets in the cache.This configuration, shown
in Figure 6.7(b), is calledUMON-global. UMON-global enforces a uniform partition for
all the sets in the cache. Compared to UMON-local, UMON-globa reduces the number
of hit counters required to implement UMON by a factor of number of sets in the cache.
However, the number of tag entries required to implement UMON still remains equal to
the number of lines in the cache.
6.3.3 Reducing Storage Overhead Using DSS
The number of UMON circuits in the system is equal to the number of cores. For
the UMON circuit to be practical, it is important that it requires low hardware overhead.
UMON-global requires an extra tag entry for each line in the cache. If each tag entry is
4 bytes then the UMON overhead per cache line is 8 bytes for a two-core system and 16
bytes for a four-core system. Considering that the baselinecache is 64 byte in size, the
overhead of UMON-global is still substantial. To reduce theov rhead of UMON, we use
the Dynamic Set Sampling (DSS)concept proposed in Chapter 3. The key idea behind DSS



















Represents tag entry in the Auxilary Tag Directory (ATD)
Represents hit counter for a recency position
Shows association of recency position to counter
Figure 6.7: Utility Monitors: (a) UMON-local (b) UMON-global (c) UMON implemented
with Dynamic Set Sampling (DSS).
use DSS to approximate the hit counter information of UMON-global by sampling few sets
in the cache. Figure 6.7(c) shows the UMON circuit with Dynamic Set Sampling (UMON-
DSS). The ATD in UMON-DSS contains ATD entries only for two sets A and C instead
of all the four sets in the cache. An important question is that how many sampled sets are
required for UMON-DSS to approximate the performance of UMON-global? We derive
analytical bounds4 for UMON-DSS in the next section and in Section 6.5.5, we compare
the performance of UMON-DSS with UMON-global.
4DSS was used in Chapter 3 to choose between two replacement policies. Thus, it was used to approx-
imated a global decision which had a binary value (one of the two replacement policy) by using the binary
decisions obtained on the sampled sets. We are interested inapproximating the global partitioning decision
which is a discrete value (how many ways to allocate) by usingthe hit counter information of the sampled
sets. Therefore the bounds derived in Chapter 3 are not applicable to the proposed mechanism.
102
6.3.4 Analytical Model for Dynamic Set Sampling
Let there be two applicationsA andB competing for a cache containingS sets. Let
a(i) denote the number of ways that application A receives for a given seti, if the partition-
ing is done on a per-set basis. Then ifa(i) does not vary across sets then even with a single
set UMON-DSS can approximate UMON-global. However,a(i) may vary across sets. The
number of ways allocated to application A by UMON-global (ug) can be approximated as







Let n be the number of randomly selected sets sampled by UMON-DSS.Let us be the
number of ways allocated to application A by UMON-DSS. We areint rested in bounding
the value of|us − ug| to some thresholdǫ. If σ2 is the variance in the values ofa(i) across
all the sets, then by Chebyshev’s inequality [68]:
P (|us − ug| ≥ ǫ) ≤ σ2/(n · ǫ2) (6.3)
For boundingus to within one way ofug, ǫ = 1.
P (us is at least one way from ug) ≤ σ2/n (6.4)
P (us is within one way from ug) > 1 − (σ2/n) (6.5)
As Chebyshev’s inequality considers only variance withoutmaking any assumption
about the distribution of the data, the bounds obtained fromChebyshev’s inequality are
pessimistic5[68]. Figure 6.8 shows the lower bound provided by Chebyshev’ inequality
as the number of sampled sets is varied, for different valuesof variance. For most of
5In general, much tighter bounds can be obtained if the mean and the distribution of the sampled data are
known [68].
103
0 8 16 24 32 40


















































Figure 6.8: Bounds on Number of Sampled Sets
the workloads studied, the value of variance (σ2) is less than 3, indicating that even with
the pessimistic bounds, as few as 32 sets are sufficient for UMON-DSS to approximate
UMON-global. We compare UMON-DSS to UMON-global in Section6.5.5. Unless stated
otherwise, we use 32 sets for UMON-DSS. The sampled sets for UMON-DSS are chosen
using the simple static policy [63], which means set 0 and every 33rd set is selected. For
the remainder of the chapter UMON by default means UMON-DSS.
6.3.5 Partitioning Algorithm
The partitioning algorithm reads the hit counters from all the UMON circuits of
each of the competing applications. The partitioning algorithm tries to minimize the total
number of misses incurred by all the applications. The utility information in the hit coun-
ters directly correlates with the reduction in misses for a given application when given a
104
fixed number of ways. Thus, reducing the most number of missesi quivalent to maxi-
mizing the combined utility. IfA andB are two applications with utility functionsUA and
UB respectively, then for partitioning decisions, the combined utility (Utot) of A and B is





1 ... F or i = 1 to (16 − 1) (6.6)
The partition that gives the maximum value forUtot is selected. In our studies, we
guarantee that the partitioning algorithm gives at least one way to each application. We
invoke the partitioning algorithm once every five million cycles (a design choice based
on simulation results). After each partitioning interval,the hit counters in all UMONs are
halved. This allows the UMON to retain past information while giving importance to recent
information.
6.3.6 Changes to Replacement Policy
To incorporate the decisions made by the partitioning algorithm, the baseline LRU
policy is augmented to enable way partitioning [13][78][28]. To implement way partition-
ing, we add a bit to the tag-store entry of each block to identify the core which installed the
block in the cache. On a cache miss, the replacement engine counts the number of cache
blocks that belong to the miss-causing application in the set. If this number is less than the
number of blocks allocated to the application, then the LRU block among all the blocks
that do not belong to the application is evicted. Otherwise,th LRU block among all the
blocks of the miss-causing application is evicted.
If the number of ways allocated to an application is increased by the partitioning
algorithm, then these added ways are consumed by the application only on cache misses.
This gradual change of partitions allows the cache to retainthe cache blocks till they are
required by the application that is allocated the cache space.
105
6.4 Experimental Methodology
6.4.1 Multicore System Configuration
Table 6.1 shows the parameters of the baseline configurationused in our experi-
ments. We use an in-house simulator that models the alpha ISA. The processor core is
8-wide issue, out-of-order, with 128-entry reservation station. The first-level instruction
cache and data cache are private to the processor core. The proc ssor parameters are kept
constant in our study. This allows us to use a fast event-driven processor model to reduce
simulation time. Because our study deals with the memory system we model the memory
system in detail. DRAM bank conflicts and bus queuing delays are modeled. The baseline
L2 cache is shared among all the processor cores and uses LRU replacement. Thus, the L2
cache gets partitioned among all the competing cores on a demnd basis.
Table 6.1: Multicore System Configuration.
Processor 8 wide, out-of-order, with 128 entry reservation station;
core 64 kB hybrid branch predictor with 4k-entry BTB
minimum branch misprediction penalty of 15 cycles.
L1 Icache and Dcache :16kB, 64B line-size, 4-way, LRU.
The L1 caches are private to each core.
Unified 1MB, 64B line-size, 16-way with LRU replacement,
Shared 15-cycle hit, 32-entry MSHR, 128-entry store buffer.
L2 Cache L2 cache is shared among all the cores
Memory 32 DRAM banks; 400-cycle access latency;
bank conflicts modeled; maximum 32 outstanding requests
Bus 16B-wide split-transaction bus at 4:1 frequency ratio.
queueing delays modeled
6.4.2 Multicore Performance Metrics
There are several metrics to quantify the performance of a system in which multiple
applications execute concurrently. We discuss the three metrics commonly used in the
106
literature: weighted speedup, sum of IPCs, and harmonic mean of normalized IPCs. Let
IPCi be the IPC of theith application when it concurrently executes with other applications
andSingleIPCi be the IPC of the same application when it executes in isolation. Then,







IPCnorm hmean = N/
∑
(SingleIPCi/IPCi) (6.9)
TheWeighted Speedupmetric indicates reduction in execution time. TheIPCsum
metric indicates the throughput of the system but it can be unfair to a low IPC applica-
tion. TheIPCnorm hmean metric balances both fairness and performance [45]. We willuse
Weighted Speedupas the metric for quantifying the performance of multicore configura-
tions throughout the chapter. Evaluation with theIPCsum andIPCnorm hmean metric will
also be discussed for some of the key results in the chapter.
6.4.3 Multi-programmed Workloads
We use benchmarks from the SPEC CPU2000 suite for our studies. A represen-
tative slice of 250M instructions is obtained for each benchmark using a tool that we de-
veloped using the SimPoint methodology [58]. Two separate benchmarks are combined to
form one multiprogrammed workload that can be run on a dual-core system. To include a
wide variety of multiprogrammed workload in our study, we classify the multiprogrammed
workloads into five categories. Workloads withWeighted Speedupfor the baseline config-
uration between 1 and 1.2 are classified asType A, between 1.2 and 1.4 asType B, between
1.4 and 1.6 asType C, between 1.6 and 1.8 asType D, and between 1.8 and 2 asType E.
A suite containing 20 workloads is created by using four workloads from each of the five
categories.
107
Simulation for a dual-core system is continued until both benchmarks in the mul-
tiprogrammed workload execute at least 250M instructions each. If a benchmark finishes
the stipulated 250M instruction before the other benchmarkfinishes 250M instruction, it
is restarted so that the two benchmarks continue to compete for the L2 cache throughout
the simulation. Table 6.2 shows the classification based on baseline weighted speedup
(BaseWS), Misses Per 1000 Instruction (MPKI) and Cycles PerInst uction (CPI) for the
baseline dual-core configuration for all the 20 workloads. The benchmark names for ammp
(amp), swim (swm), perlbmk (perl), and wupwise (wup) are abbreviated.
Table 6.2: Multi-programmed Workload Summary
Category Workload MPKI MPKI CPI CPI
(BaseWS) Bmk1-Bmk2 Bmk1 Bmk2 Bmk1 Bmk2
galgel-vpr 11.84 8.41 1.25 2.55
TYPE A galgel-twolf 11.46 11.44 1.20 3.51
(1.0-1.2) amp-galgel 6.91 10.62 1.74 1.21
apsi-galgel 3.08 10.82 1.14 1.19
twolf-vpr 8.76 6.22 2.81 2.06
TYPE B apsi-twolf 2.05 7.51 0.93 2.61
(1.2-1.4) amp-art 6.73 43.73 1.72 4.90
apsi-art 2.91 43.12 1.12 4.76
apsi-swm 2.71 22.98 1.05 2.89
TYPE C amp-applu 6.71 13.76 1.69 1.28
(1.4-1.6) swm-twolf 22.98 10.64 2.73 3.26
art-parser 42.75 3.48 4.52 1.33
equake-vpr 18.33 5.74 4.57 1.97
TYPE D vpr-wup 5.40 2.25 1.89 0.72
(1.6-1.8) gzip-twolf 1.61 5.36 0.84 2.17
art-crafty 41.10 0.63 4.33 0.96
fma3d-swm 4.62 23.53 0.51 2.94
TYPE E mcf-applu 134 13.76 28.5 1.27
(1.8-2.0) gap-mesa 1.66 0.62 0.41 0.35
crafty-perl 0.14 0.04 0.81 0.44
108
6.5 Results and Analysis
6.5.1 Performance on Weighted Speedup Metric
We compare the performance of UCP to two partitioning schemes: LRU and Half-
and-Half. The Half-and-Half scheme statically partitionsthe cache equally among the
two competing applications. The disadvantage of the Half-and Half scheme over LRU
is that it cannot change the partition in response to the varying demands of competing
applications. However, it also has the advantage of performance isolation, which means
that the performance of an application does not degrade substantially when it executes
concurrently with a badly behaving application. Figure 6.9shows the weighted speedup of
the three partitioning schemes. The bar labeledgmeanrepresents the geometric mean of



























































































































Figure 6.9: Performance of LRU, Half-and-Half, and UCP.
109
LRU performs better than Half-and-Half for some workloads and for some Half-
and-Half performs better than LRU. For most workloads, UCP outperforms the best-performing
scheme out of the other two schemes. On average, UCP improvesperformance by 10.96%
over the baseline LRU policy, increasing the geometric meanweighted speedup from 1.46
to 1.62.
The Type A category contains workloads where both benchmarks in the workload
have high utility and high demand for the L2 cache. Therefore, th baseline LRU policy
has a value of weighed speedup that is almost half of the idealvalue of 2. Partitioning the
cache based on utility, rather than demand, improves performance noticeably. For example,
UCP increases the weighted speedup for the workload galgel-twolf from 1.04 to 1.28.
The Type C category contains workloads where one benchmark has high utility and
the other has low utility. In such cases, UCP allocates most of the cache resource to the
application with high utility, thus improving the overall performance. For example, for
amp-applu, UCP allocates 14 or more ways out of the 16 ways to amp, improving the
weighted speedup from 1.49 to 1.83.
When both benchmarks in the workload have low utility (e.g. mcf-applu), the
performance of each benchmark in the workload is not sensitive to the amount of cache
available to it, so the weighted speedup is close to ideal. Similarly, if the cache can ac-
commodate the working set of both benchmarks in the workload, the weighted speedup
for that workload is close to ideal. Such workloads are included in the Type E category.
As the weighted speedup of these workloads is close to the ideal, UCP does not change
performance significantly.
For twolf-vpr, crafty-perl, and gzip-twolf , UCP reduces performance marginally
compared to LRU. This happens because UCP allocates partitions once every partition in-
terval (5M cycles in our experiments), so it is unable to respond to the phase changes that
occur at a finer granularity. On the other hand, LRU can respond t such fine-grained
110
change in behavior of the applications by changing the partitions potentially at every ac-
cess. The LRU policy also has the advantage of doing the partitioning on a per-set basis
depending on the demand on the individual set. On the other hand, the proposed UCP
policy globally allocates a uniform partition for all the set in the cache, sacrificing fine-
grained, per-set, control for reduced overhead.
6.5.2 Performance on Throughput Metric
Figure 6.10 compares the performance of the baseline LRU policy to the proposed
UCP policy for the throughput metric,IPCsum. To show the change in the IPC of the indi-
vidual benchmark of each workload, the graph is drawn as a stacked bar graph. The IPC of
the first benchmark that appears in the name of the workload isl beled asIPC-Benchmark1.
The IPC of the other benchmark is labeled asIPC-Benchmark2. For example, for the work-
load galgel-vpr, IPC-Benchmark1 shows the IPC of galgel andIPC-Benchmark2 shows the
IPC of vpr. The bar labeledhmeanrepresents the harmonic mean of theIPCsum of all the
20 workloads.
For 15 out of the 20 workloads, UCP improves theIPCsum compared to the LRU
policy. UCP can improve performance by improving the IPC of one benchmark in the
workload without affecting the IPC of the other benchmark inthe workload. Examples of
such workloads are apsi-swm and equake-vpr. UCP can also improve the aggregate IPC
by marginally reducing the IPC of one benchmark and significantly improving the IPC for
the other benchmark. Examples include apsi-galgel and amp-art. For theIPCsum metric,
UCP reduces performance on two workloads, gzip-twolf and crafty-perl. On average, UCP
improves the performance on the throughput metric by 16.8%,increasing the harmonic

























































































































Figure 6.10: LRU (left bar) vs. UCP (right bar) on throughputmetric.
6.5.3 Evaluation on Fairness Metric
A dynamic partitioning mechanism may improve the overall performance of the
system at the expense of severely degrading the performanceof on of the applications. The
harmonic mean of the normalized IPCs is shown to consider both fairness and performance
[45]. Figure 6.11 shows the performance of LRU, Half-and-Half, and UCP for this metric.
The bar labeled gmean is the geometric mean over all the 20 workloads. UCP improves
the average on this metric by 11% increasing the gmean from 0.71 to 0.79. Note that more

























































































































Figure 6.11: LRU, Half-and-Half, and UCP on fairness metric.
6.5.4 Phase-Based Adaptation of UCP
The utility of cache resources for an application can vary over time. The dynamic
partitioning of UCP allows it to adapt to the time-varying phase behavior of the competing
applications. The variation in utility for cache resourcesof an application may not correlate
with its variation in demand for cache resources. We analyzethe time varying phase behav-
ior of the workload swim-twolf by comparing UCP and LRU to a partitioning scheme that
statically allocates a fixed number of ways to each competingapplication. Figure 6.12(a)
shows the MPKI of swim for the static partitioning scheme as the number of ways devoted
to swim is varied.
With static partitioning, devoting less than 9 ways to swim increases its MPKI con-
siderably. When swim and twolf are executed together, the bas line LRU policy allocates,
113
0 2 4 6 8 10 12 14 16







































UCP (AVG) LRU (AVG)
0 10 20 30 40 50 60 70 80











































Figure 6.12: UCP vs. Static Partitioning(a) Variation in MPKI as the number of ways allocated
to swim is changed statically. (b) The average number of waysdynamically allocated to swim when it is
executed with twolf by the LRU policy and UCP policy.
on average6, 10.5 ways to swim, whereas, the UCP policy allocates, on average, only 3.3
ways to swim. However, the MPKI of swim with the UCP policy (23.7) remains similar
that with the LRU policy (22.98). This happens because UCP alloc tes ways to swim only
in phases when the allocated ways are likely to reduce the misses. Figure 6.12(b) shows the
number of ways allocated to swim over time by LRU and UCP. LRU consistently allocates
10 or more ways to swim throughout the simulation. UCP allocates 9 ways to swim only
between 230M and 320M cycles of simulation, and three or fewer ways otherwise. As
swim receives cache resources in the phase when not having them would increase MPKI
considerably, the number of misses for swim does not increase compared to the LRU pol-
icy. Reducing the average number of ways of swim from 10.5 to 3.3 allows twolf to have
12.7 ways instead of 5.5 ways. This reduces the MPKI of twolf fr m 10.64 to 5.18.
6The average number of ways allocated to an application by theLRU policy is measured by sampling the
cache every 2M cycles. The number of lines present in the cache for the given application is counted and this
number is divided by the number of sets in the cache.
114
6.5.5 Effect of Varying the Number of Sampled Sets
We use 32 sets for each of the UMON circuit in the default UCP configuration. This
section analyzes the sensitivity of the UCP mechanism to thenumber of sampled sets in
the UMON. Figure 6.13 compares the performance of four UCP configurations: the first
samples 8 sets, the second samples 16 sets, the third is the default UCP configuration with



























































































































Figure 6.13: Effect of Number of Sampled Sets on UCP.
For all workloads, the default UCP configuration with 32 sampled sets performs
similar to UMON-global (All sets). The performance of the workl ad galgel-twolf reduces
if the number of sampled sets is reduced to 8. For other workloads, the performance of UCP
is relatively insensitive to the number of sampled sets (forsampled sets≥ 8). This is con-
sistent with the lower bounds derived with the analytical model presented in Section 6.3.4.
This result is particularly useful result as it means that default UCP configuration with only
32 sets performs similar to the UMON-global configuration without requiring the huge
hardware overhead associated with the UMON-global configuration. This reduced over-
head makes the UCP scheme practical. The next section quantifies the hardware overhead.
115
6.5.6 Hardware Overhead of UCP
The major source of hardware overhead of UCP is the UMON circuit. Table 6.3 de-
tails the storage overhead of UMON containing 32 sampled sets, assuming a 40-bit physical
address space. Each UMON requires 1920 B of storage overhead(less than 0.2% of the
area of the baseline 1MB cache), indicating that for the baseline dual-core configuration
UCP requires less than 0.4% of storage overhead for implementing the UMON circuits.
The low overhead for UMON means that the UCP scheme is cost-effective even if the
number of core increases (e.g. UMON overhead of less than 1% with four cores). The
storage overhead of UMON can further be reduced by using partial tags in the ATD. In
addition to the storage bits, each UMON also contains an adder for incrementing the hit
counters and a shifter to halve the hit counters after each partitioning interval.
Table 6.3: Storage Overhead of a UMON circuit with 32 Sets
Size of each ATD entry (1 valid bit + 24-bit tag + 4-bit LRU ) 29 bits
Total number of ATD entries per sampled set (1/way * 16) 16
ATD overhead per sampled set (29 bits/way * 16 ways) 58 B
Total ATD overhead (32 sampled sets * 58 B/set) 1856 B
Overhead of hit counters (16 counters * 4B each) 64 B
Total UMON overhead (1856B + 64B) 1920 B
Area of baseline L2 cache (64kB tags + 1MB data) 1088 kB
% increase in L2 area due to 1 UMON (1920B/1088kB) 0.17%
Implementing way-partitioning on a dual-core system requires a bit in each tag-
store entry to identify which of the two cores installed the lin in the cache. The partitioning
algorithm contains a comparator circuit and requires negligible storage. Note that none of
the structures or operations required by UCP is in the critical path, resource-intensive,
complex, or power hungry.
116
6.6 Scalable Partitioning Algorithm
We assumed that the partitioning algorithm is able to find thepartition of maximum
utility by computing the combined utility of all the applications for every possible partition.
This is not a problem when there are only two applications, asanN-waycache can be way-
partitioned among two applications in onlyN+1 ways. However, the number of possible
partitions increases exponentially as the number of competing applications, making it im-
practical to evaluate every possible partition. For example, a 32-way cache can be shared
by four applications in6, 545 ways, and by 8 applications in15, 380, 937 ways. Finding
an optimal solution to the partitioning problem has been shown to be NP-hard [65]. In this
section we develop a partitioning algorithm that has a worst-ca e time complexity ofN2/2.
6.6.1 Background
Our algorithm is derived from the greedy algorithm proposedin [76]. Thegreedy
algorithm is shown in Algorithm 2. In each iteration, one block7 is assigned to the ap-
plication that has the maximum utility for that block. The itration continues till all the
blocks are assigned. This algorithm is shown to be optimal ifthe utility curves for all the
competing applications are convex [76]. However, when the uility curves are non-convex,
the greedy algorithm can have pathological behavior. Figure 6.14 shows example of two
benchmarks, art and galgel, that has non-convex utility curve. Art shows no reduction in
misses until it is assigned at least 8 blocks and after that itshows huge reduction in misses.
As the greedy algorithm considers the gain from only the immediat one block it will not as-
sign any blocks to art (unless the utility of that block for even the other application is zero).
To address this shortcoming of the greedy algorithm, Suh et.al [78] propose to also invoke
7We use the term blocks instead of ways because the greedy algorithm was used in [76] to decide the
number of cache blocks that each application receives in a fully associative cache. However, the explanation
can also be thought of as assigning ways in a set-associativec che.
117
the greedy algorithm for each combination of the non-convexpoints of all applications.
However, the number of times the greedy algorithm is invokedincreases with the number
of combinations on non-convex points of all the applications. Figure 6.14 shows that an
application (galgel) can have as many as 15 non convex points, indicating that the number
of combinations of non-convex points of all the competing applications can be very large.
To avoid the time complexity, [78] suggests that the greedy algorithm be invoked only for
some number of randomly chosen combination of non-convex points. However, for a given
number of trials, the likelihood that randomization will yield the optimum partition reduces
as the number of combinations increase.
Algorithm 2 Greedy Algorithm
balance = N /* Num blocks to be allocated */
allocations[i] = 0 for each competing application i
while(balance)do:
foreachapplication i,do: /* get utility for next 1 block */
alloc = allocations[i]
Unext[i] = get util value(i, alloc, alloc+1)




get util value(p, a, b):
U = change in misses for application p when the number
of blocks assigned to it increases from a to b
return U
118
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

























(Remaining ways are turned off)
Figure 6.14: Benchmarks with non-convex utility curves
6.6.2 The Lookahead Algorithm
We definemarginal utility (MU) as the utilityU per unit cache resource. Ifmissa
andmissb are the number of misses that an application incurs when it receiv sa and b
blocks respectively, then the marginal utility,MU ba of increasing the blocks from a to b is
defined as:
MU ba = (missa − missb)/(b − a) = U ba/(b − a) (6.10)
The basic problem with the greedy algorithm is that it considers the marginal utility
of only the immediate block, and thus fails to see potentially high gains after the first
block if there is no gain from the first block. If the algorithmcould also take into account
the gains from far ahead, then it could make better partitioning decisions. We propose
theLookahead Algorithm, which considers the marginal utility for all possible number of
blocks that the application can receive. The pseudo code forthe Lookahead algorithm is
shown in Algorithm 3.
119
Algorithm 3 Lookahead Algorithm
balance = N /* Num blocks to be allocated */
allocations[i] = 0 for each competing application i
while(balance)do:
foreachapplication i,do: /* get max marginal utility */
alloc = allocations[i]
max mu[i] = get max mu(i, alloc, balance)
blocks req[i] = min blocks to get maxmu[i] for i
winner = application with maximum value of maxmu
allocations[winner] += blocksreq[winner]
balance – = blocksreq[winner]
return allocations
get max mu(p, alloc, balance):
max mu = 0
for (ii=1; ii<=balance; ii++)do:
mu = getmu value(p, alloc, alloc+ii)
if ( mu> max mu ) maxmu = mu
return maxmu
get mu value(p, a, b):
U = change in misses for application p when the number
of blocks assigned to it increases from a to b
return U/(b-a)
120
In each iteration, the maximum marginal utility(maxmu)and the minimum number
of blocks at which the maxmu occurs is calculated for each application. The application
with highest value for maxmu is assigned the number of blocks it needs to obtain maxu.
Ties for highest value of maxmu are broken arbitrarily. The iterations are repeated until all
blocks are assigned. The lookahead algorithm can assign a different number of blocks in
each iteration and is guaranteed to terminate as at least onebl ck is assigned in each itera-
tion. For applications with convex utility function, the maximum value of marginal utility
occurs for the first block. Therefore, if all the applications have convex utility function,
then the lookahead algorithm behaves identical to the greedy algorithm, which is proved to
be optimal for convex functions.
The step for obtaining the value of maxmu for each of the application is executed
in parallel by the UMON circuits. Calculating the maxmu for an application if it could
get up toN blocks takesN operations of add-divide-compare each. As the blocks are
allocated, the number of blocks that an application can receiv in an iteration reduces. In
the worst case only one block is allocated in every iteration. Then, even in the worst case,
the time required for the lookahead algorithm to allocate N blocks is: N + (N − 1) +
(N − 2) + ... + 1 = N(N − 1)/2 ≈ N2/2 operations. In our studies, cache is assigned
on a way granularity instead of a block granularity. Therefo, the value of N is equal to
the associativity of cache. Thus, for partitioning a 32-wayc che the lookahead algorithm
will require a maximum time of 512 operations (recall that weperform partitioning once
every 5M cycles). In our experiments, we ensure that both thegre dy algorithm and the
lookahead algorithm allocates at least one way to each of thecompeting applications.
121
6.6.3 Result for Partitioning Algorithms
We evaluate the partitioning algorithms on a quad-core system in which four appli-
cations share a 2MB 32-way cache. As there are four cores, theideal value for weighted
speedup is 4. Figure 6.15 shows the weighted speedup for the LRU policy, and the UCP
policy with the three partitioning algorithms - greedy, lookahead, and EvalAll. TheEvalAll
algorithm evaluates all the possible partitions to find the best partition. The greedy al-
gorithm works well when all the benchmarks in the workload have convex utility curves
(mix1) or when the cache is big enough to support the working set of majority of the bench-
marks in the workload (mix2). However, for workloads that contain benchmarks with non
convex utility curves (mix3 and mix4), the greedy algorithmdoes not perform as well as
the EvalAll algorithm. The lookahead algorithm performs similar to the EvalAll algorithm








































Figure 6.15: Comparison of Partitioning Algorithms
122
6.7 Related Work
6.7.1 Related Work in Cache Partitioning
Stone et al. [76] investigated optimal (static) partitioning of cache memory between
two or more applications when the information about change imisses for varying cache
size is available for each of the competing application. However, such information is hard to
obtain statically for all applications as it may depend on the input set of the application. The
objective of our study is to dynamically partition the cacheby computing this information
at runtime. Moreover, as shown in Section 6.5.4, dynamic partitioning can adapt to the
time-varying phase behavior of the competing applications, which makes it possible for
dynamic partitioning to out perform even the best static partitioning.
Dynamic partitioning of shared cache was first investigatedby Suh et al. [79][78].
[78] describes a low-overhead scheme that uses recency position of the hits for the lines in
the cache to estimate the utility of the cache for each application. However, obtaining the
utility information from main cache has the following shortcomings: (1) The number of
lines in each set for which the utility information can be obtained for a given application is
also dependent on the other application. (2) The recency position at which the application
gets a hit is also affected by the other application, which means that the utility information
computed for an application is dependent on (and polluted by) the concurrently execut-
ing application. UCP avoids these problems by separating the monitoring circuit from the
main cache so that the utility information of the application s independent of other concur-
rently executing applications. Figure 6.16 compares UCP toa scheme that uses in-cache
information for estimating utility information. The in-cahe scheme provides 4% aver-
age improvement compared to the 11% average improvement of UCP. Thus, seperating
the monitoring circuit from the main cache is important to obtain high performance from
dynamic partitioning. However, doing this by having extra tags for each cache line incurs
prohibitive hardware overhead. Our proposal makes it practic l to compute the utility infor-
123



























































































































Figure 6.16: UCP vs. an In-cache monitoring scheme.
Mechanisms for enabling Quality of Service (QoS) in multicore and multithreaded
architectures are discussed in [28]. It emphasizes that factors such as priority, locality, and
latency sensitivity should be considered in dividing the cache among competing applica-
tions. It also describes different mechanisms to facilitate static and dynamic partitioning
of cache between competing applications. However, the design of intelligent partitioning
policies to use these mechanisms is left as an open research topi .
Recently, Hsu et al. [27] studied different policies, including a utilitarian policy,
for partitioning a shared cache between competing applications. However, they analyzed
these policies using best offline parameters and mechanismsfor obtaining these parameters
at runtime is left for future work.
124
6.7.2 Related Work in Cache Organization
Liu et al. [44] investigated cache organizations for CMPs. They proposedShared
Processor-Based Split L2Caches, in which the number of private banks allocated to each
competing application is decided statically using profile information. However, it may be
impractical to profile all the applications that execute concurrently. Our mechanism avoids
profiling by computing the utility information at run-time using cost-effective UMONs.
Recent proposals [14][10] have looked at dynamic mechanisms to obtain the hit
latency of a private cache while approximating the capacitybenefits of a shared cache. Our
work differs from these in that it focuses on increasing the capacity benefits of a shared
cache. It can be combined with these proposals to obtain bothimproved capacity and
improved latency from a cache organization.
6.7.3 Related Work in Memory Allocation
In the operating systems domain, Zhou et al. [90] looked at page allocation for
competing applications usingmiss ratio curve. The objective of both their study and our
study is the same, however, their study deals with the allocati n of physical memory, which
is fully associative, whereas, our study deals with the alloc ti n of on-chip caches, which
are set-associative. The hardware solution proposed in [90] stores an extra tag entry for
each page in a separate hardware structure for each competing application. While this may
be cost-effective in paging domain (approximately 4B per 4kB page), keeping multiple tags
for each cache line for on-chip caches is hardware-intensivand power-hungry. For exam-
ple, if four applications share a cache and each tag-entry is4B, then the storage required per
cache line is 16B (which is a 25% overhead for a 64B cache line), rendering the scheme im-
practical for on-chip caches. Fortunately, on-chip cachesar set-associative which makes
them amenable to dynamic set sampling (DSS). Our mechanism uses DSS to propose a
cost-effective partitioning framework which requires less than 1% storage overhead.
125
6.7.4 Related work in SMT
Several proposals [80][9][15] have looked at policies for dynamic partitioning of
processor resources such as reorder buffer entries, execution bandwidth, and physical regis-
ter file between the applications that concurrently executeon an SMT processor. However,
none of these proposals discuss the problem of partitioningthe last-level cache among the
competing applications. Although, we evaluated UCP for CMPprocessors, ideas presented
in this chapter are also applicable for SMT processors.
6.8 Summary
Traditional designs for a shared cache use LRU replacement which partitions the
shared cache among competing applications on a demand basis. The application that ac-
cesses more unique lines in a given interval gets more cache than an application that ac-
cesses fewer unique lines in that interval. However, the benefit (reduction in misses) that
applications get for a given amount of cache resources may not correlate with the demand.
This chapter proposesUtility-Based Cache Partitioning (UCP)to divide the cache among
competing applications based on the benefit (utility) of cache resource for each application
and makes the following contributions:
1. It proposes a low hardware overhead, utility monitoring circuit to estimate the utility
of the cache resources for each application. Evaluation with 20 multiprogrammed
workloads shows that UCP outperforms LRU on dual-core system by up to 23% and
on average 11%, while requiring less than 1% storage overhead.
2. It proposes theLookahead Algorithm, as a scalable alternative to evaluating every
possible partition for partitioning decisions when there a large number of appli-
cations sharing a highly associative cache.
126
Chapter 7
Conclusions and Future Work
7.1 Conclusions
To bridge the gap between processor speed and memory speed, modern processors
devote majority of the on-chip transistors to the last levelcache. However, traditional cache
designs – developed for small first-level caches – are ineffici nt for large caches. There-
fore, cache misses are common which results in frequent memory accesses and reduced
processor performance. The importance of cache managementhas become even more crit-
ical because of the increasing memory latency, increasing working sets of many emerging
applications, and decreasing size of cache devoted to each core due to increased number
of cores on a single chip. This dissertation focuses on analyzing some of the problems
with managing large caches and proposing cost-effective solutions to improve their perfor-
mance.
Chapter 3 proposes a cost-effective hybrid replacement mechanism that can select
from multiple replacement policies depending on which policy incurs the fewest cache
misses. This technique exploits the fact that different workloads and program phases have
locality characteristics that make them better suited to different replacement policies. Thus,
a mechanism that selects the best performing policy at runtime substantially improve cache
performance. To implement hybrid replacement with low-overhead, it shows that cache
behavior can be approximated by sampling few sets and proposes the concept ofDynamic
Set Sampling (DSS).
127
Chapter 4 focuses on improving the cache insertion policy. The commonly used
LRU replacement policy results in thrashing for memory-intensive workloads that have
a working set bigger than the cache size. This dissertation sh w that performance of
memory-intensive workloads can be improved significantly by changing the recency posi-
tion where the incoming line is inserted. The proposed mechanism reduces cache misses
by 21% over LRU, is robust across a wide variety of workloads,incurs an overhead of less
than two bytes, and does not change the existing cache structure.
Chapter 5 targets the variation in performance impact of misses to propose parallelism-
aware cache replacement. Modern systems try to service multiple cache misses in paral-
lel. The variation in Memory Level Parallelism (MLP) causessome misses to be more
costly on performance than other misses. This dissertationpresents the first study on MLP-
aware cache replacement and proposes to improve performance by eliminating some of the
performance-critical isolated misses. It also presents a framework that can compute the
mlp-based cost at runtime and uses this cost to drive a cost-sen itive replacement policy.
Chapter 6 analyzes cache partitioning policies for shared caches in chip multi-
processors. Traditional partitioning policies either divi e the cache equally among all
applications or use the LRU policy to do a demand based cache partitioning. This dis-
sertation shows that performance can be improved if the shared cache is partitioned based
on how much the application benefits from the cache, rather than on its demand for the
cache. It proposes a novel low-overhead circuit that can dynamically monitor the utility of
cache for any application. The proposed partitioning improves weighted-speedup by 11%,
throughput by 17% and fairness by 11% on average compared to LRU. This dissertation
also proposes a low time-complexity algorithm that is scalable to many cores and performs
similar to searching through all the exponential number of possible partitions.
128
7.2 Future Work
7.2.1 Applications of Dynamic Set Sampling
In the era of CMPs, L2 caches become a scarce resource. Therefore, it is important
to intelligently manage the cache space among competing applic tions. However, to be
implementable and useful, such management must be cost-effec ive. Because DSS can
approximate the cache behavior by sampling only a few sets inthe cache, it provides a cost-
effective framework for cache management decisions. The idea of dynamic set sampling
can also be used for other cache related optimizations, suchas dynamically tuning the
parameters of a given replacement policy, reducing the hardw re overhead of an expensive
replacement policy, or reducing the pollution caused by prefetching mechanisms.
7.2.2 Region-Aware Cache Management
This dissertation took an approach of making cache decisions globally. For exam-
ple, choosing a single replacement policy for the entire cache, or allocating uniform number
of ways devoted to an application across all the sets in a shared cache. Thus, it does not
take into account the variation in locality that exists betwen different regions of mem-
ory and across different instructions. For example, the locality characteristics and MLP
of data fetched by loads that access arrays is very differentf om that fetched by pointer-
chasing loads. Thus, it may be possible to improve the cache management by doing a
per-instruction or per-region control rather than a globalcontrol. However, the primary
challenge in such a scheme would be to gather information anddo enforcement in a cost-
effective manner as the number of regions increase.
7.2.3 Prefetching-Aware Cache Management
This dissertation studied cache management for the demand stream. An important
area of future work in cache management is how to partition the cache between demand
129
and prefetch streams. Heuristics such as prefetcher accuracy, reuse of demand lines, and
bandwidth contention can be used to limit the prefetcher-based pollution while maximizing
the latency hiding advantage of the prefetching mechanism.Also, the performance impact
of cache lines that are easily prefetch-able is very different from cache lines that are hard
to prefetch. Thus, cache performance can be greatly improved if hard-to-prefetch lines are
given more cache space than easier-to-prefetch lines. The framework presented in Chapter
5 can be extended to implement such a prefetching-aware cache management scheme.
7.2.4 MLP-Aware Microarchitecture and Memory System
Although we proposed MLP-awareness for cache replacement,the concept of MLP-
awareness can be extended to design an MLP-aware memory system. For example, MLP-
aware prefetchers can focus on prefetching costly misses. The idea of MLP-awareness is
also applicable to processor design. For example, an MLP-aware fetch policy can allow
SMT processors to exploit the maximum MLP from a cache-miss-causing thread.
7.2.5 Extensions of Cache Partitioning
We considered the problem of cache partitioning among the demand streams of
competing applications. The Utility Monitoring circuit (UMON) can be extended to com-
pute utility information for prefetched data, which can help in partitioning the cache among
multiple demand and prefetch streams. The UMON circuits canalso be modified to com-
pute the CPI information for a given application, which can help in providing performance
or fairness guarantees for an application [28][38]. We investigated UCP only for multipro-
grammed workloads. For multithreaded workloads, UCP can take into account both the
variation in the utility of private and shared data, as well as the variation in the utility of
private data of competing threads. Also, the insertion policies proposed in Chapter 4 can be





Proposed Techniques on Remaining SPEC Benchmarks
This appendix analyzes the effect of the optimizations described in this disserta-
tion on the eleven SPEC CPU2000 benchmarks that were left outfr m detailed studies.
Table 1.1 shows the fraction of misses that are compulsory misses for these benchmarks.
For all benchmarks the majority of the misses are compulsorymisses. As improving cache
replacement cannot reduce the number of compulsory misses,th re is little scope for re-
ducing misses with the caching optimizations proposed in Chapter 3, 4, and 5.
Table 1.1: Compulsory misses for the remaining SPEC benchmarks
gcc vortex applu wupwise mesa crafty gzip fma3d gap perlbmk eon
51.2% 53.7% 62.9% 83.0% 83.6% 88.8% 97.2% 99.3% 99.9% 100% 100%
1.1 Hybrid Replacement via Dynamic Set Sampling
Table 1.2 shows the MPKI with four replacement policies: LRU, LFU, the TSEL
selection between LRU and LFU, and the SBAR-based selectionbetween LRU and LFU.
For all benchmarks the performance of all the four policies is imilar.
Table 1.2: MPKI with Hybrid Replacement on Remaining SPEC Benchmarks
Policy gcc vortex applu wupwise mesa crafty gzip fma3d gap perlbmk eon
LRU 0.35 0.71 13.75 2.25 0.62 0.09 1.45 4.61 1.65 0.04 0.01
LFU 0.40 0.74 13.99 2.53 0.67 0.09 1.46 4.60 1.67 0.04 0.01
TSEL 0.35 0.71 13.71 2.25 0.62 0.09 1.46 4.61 1.65 0.04 0.01
SBAR 0.35 0.71 13.71 2.25 0.62 0.09 1.46 4.61 1.65 0.04 0.01
132
1.2 Adaptive Insertion Policies
Table 1.3 shows the MPKI with the baseline LRU policy and the DIP policy pro-
posed in Chapter 4. For all benchmarks, the MPKI with both LRUand DIP are similar.
Table 1.3: MPKI with LRU and DIP on Remaining SPEC Benchmarks
Policy gcc vortex applu wupwise mesa crafty gzip fma3d gap perlbmk eon
LRU 0.35 0.71 13.75 2.25 0.62 0.09 1.45 4.61 1.65 0.04 0.01
DIP 0.35 0.72 13.76 2.26 0.63 0.09 1.46 4.61 1.66 0.04 0.01
1.3 MLP-Aware Cache Replacement
Table 1.4 shows the IPC with the baseline LRU policy, the cost-sensitive policy
LIN and the SBAR based cost-sensitive selection between LRUand LIN. For all bench-
marks, except wupwise, the performance of the the three policies s similar. For wupwise,
LIN reduces performance by 6% compared to LRU while SBAR performs similar to the
baseline.
Table 1.4: IPC with LRU, LIN, and SBAR on Remaining SPEC Benchmarks
Policy gcc vortex applu wupwise mesa crafty gzip fma3d gap perlbmk eon
LRU 0.56 1.78 0.79 1.40 2.91 1.26 1.27 1.99 2.44 2.30 1.91
LIN 0.56 1.76 0.81 1.30 2.90 1.26 1.26 2.00 2.42 2.30 1.91
SBAR 0.56 1.77 0.79 1.40 2.91 1.26 1.27 2.00 2.44 2.30 1.91
133
Bibliography
[1] Anant Agarwal, John Hennessy, and Mark Horowitz. Cache performance of operat-
ing systems and multiprogramming. InACM Transactions on Computer Systems, 6,
pages 393–431, November 1988.
[2] Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. Checkpoint processing
and recovery: Towards scalable large instruction window processors. InProceedings
of the 36th Annual ACM/IEEE International Symposium on Microarchitecture, pages
423–434, 2003.
[3] Alaa R. Alameldeen and David A. Wood. Adaptive cache compression for high-
performance processors. InISCA-31, page 212, 2004.
[4] Alaa R. Alameldeen and David A. Wood. Frequent pattern compression: A significance-
based compression scheme for L2 caches. Technical Report 1500, Computer Sci-
ences Department, University of Wisconsin - Madison, 2004.
[5] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for
high-performance processors.IEEE Trans. Comput., 44(5):609–623, 1995.
[6] S. Bansal and D. Modha. CAR: Clock with adaptive replacement. Inin Proceedings
of the USENIX Conference on File and Storage Technologies(FAST), pages 187–200,
March 2004.
[7] L A Belady. A study of replacement algorithms for a virtual-storage computer. In
IBM Systems journal, pages 78–101, 1966.
134
[8] Brad Calder, Dirk Grunwald, and Joel Emer. Predictive sequential associative cache.
In Proceedings of the Second IEEE International Symposium on High Performance
Computer Architecture, pages 244–253, 1996.
[9] Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and Enrique Fernandez. Dynam-
ically controlled resource allocation in SMT processors. In Proceedings of the 37th
Annual ACM/IEEE International Symposium on Microarchitecture, pages 171–182,
2004.
[10] Jichuan Chang and Gurinar S. Sohi. Cooperative cachingfor chip multiprocessors.
In Proceedings of the 33nd Annual International Symposium on Computer Architec-
ture, pages 264–276, 2006.
[11] Chi F. Chen, Se-Hyun Yang, Babak Falsafi, and Andreas Moshovos. Accurate and
complexity-effective spatial pattern prediction. InHPCA-10, page 276, 2004.
[12] William Y. Chen, Roger A. Bringmann, Scott A. Mahlke, Richard E. Hank, and
James E. Sicolo. An efficient architecture for loop based data prefetching. In
Proceedings of the 25th Annual ACM/IEEE International Symposium on Microar-
chitecture, pages 92–101, 1992.
[13] D.T. Chiou. Extending the reach of microprocessors: column and curiouscaching.
PhD thesis, Massachusetts Institute of Technology, 1999.
[14] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Optimizing replication,
communication, and capacity allocation in CMPs. InProceedings of the 32nd Annual
International Symposium on Computer Architecture, pages 357–368, 2005.
[15] Seungryul Choi and Donald Yeung. Learning-based SMT processor resource distri-
bution via hill-climbing. InProceedings of the 33nd Annual International Symposium
on Computer Architecture, 2006.
135
[16] Yuan Chou, Brian Fahs, and Santosh Abraham. Microarchite ture optimizations for
exploiting memory-level parallelism. InProceedings of the 31st Annual International
Symposium on Computer Architecture, 2004.
[17] Robert Cooksey. Content-Sensitive Data Prefetching. PhD thesis, University of
Colorado, Boulder, 2002.
[18] Adrian Cristal et al. Kilo-instruction processors: Overcoming the memory wall.
IEEE Micro, 25(3):48–57, May 2005.
[19] Peter J. Denning. The working set model for program behavior. Communications of
the ACM, 11(5):323–333, 1968.
[20] James Dundas and Trevor Mudge. Improving data cache performance by pre-executing
instructions under a cache miss. InProceedings of the 1997 International Conference
on Supercomputing, pages 68–75, 1997.
[21] Wi fen Lin et al. Reducing dram latencies with an integrated memory hierarchy
design. InHPCA ’01: Proceedings of the 7th International Symposium onHigh-
Performance Computer Architecture, pages 301–312, 2001.
[22] W. C. Fu, J. H. Patel, and B. L. Janssens. Stride directedpr fetching in scalar pro-
cessors. InProceedings of the 25th Annual ACM/IEEE International Symposium on
Microarchitecture, pages 102–110, 1992.
[23] A. Glew. MLP yes! ILP no! InWild and Crazy Ideas Session, 8th International
Conference on Architectural Support for Programming Languages and Operating
Systems, October 1998.
136
[24] Antonio Gonzalez, Carlos Aliagas, and Mateo Valero. A data cache with multiple
caching strategies tuned to different types of locality. InICS ’95: Proceedings of the
9th international conference on Supercomputing, pages 338–347, 1995.
[25] Erik G. Hallnor and Steven K. Reinhardt. A fully associative software-managed
cache design. InProceedings of the 27th Annual International Symposium on Cm-
puter Architecture, pages 107–116, 2000.
[26] Mark Donald Hill. Aspects of cache memory and instruction buffer performance.
PhD thesis, 1987.
[27] Lisa R. Hsu et al. Communist, utilitarian, and capitalist cache policies on CMPs:
caches as a shared resource. InPACT-15, 2006.
[28] Ravi Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms.
In Proceedings of the 18th International Conference on Supercomputing, pages 257–
266, 2004.
[29] Jaeheon Jeong and Michel Dubois. Optimal replacementsin caches with two miss
costs. InSPAA ’99: Proceedings of the 11th Annual ACM Symposium on Parallel
Algorithms and Architectures, pages 155–164, 1999.
[30] Jaeheon Jeong and Michel Dubois. Cost-sensitive cachereplacement algorithms.
In Proceedings of the Ninth IEEE International Symposium on High Performance
Computer Architecture, 2003.
[31] Teresa L. Johnson.Run-time adaptive cache management. PhD thesis, University of
Illinois, Urbana, IL, May 1998.
137
[32] Doug Joseph and Dirk Grunwald. Prefetching using Markov predictors. InProceed-
ings of the 24th Annual International Symposium on ComputerArchitecture, pages
252–263, 1997.
[33] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a
small fully-associative cache and prefetch buffers. InProceedings of the 17th Annual
International Symposium on Computer Architecture, pages 364–373, 1990.
[34] Tejas Karkhanis and J. E. Smith. A day in the life of a datacache miss. InSecond
Annual Workshop on Memory Performance Issues, 2002.
[35] Tejas S. Karkhanis and James E. Smith. A first-order superscalar processor model. In
Proceedings of the 31st Annual International Symposium on Computer Architecture,
2004.
[36] Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi.Cache decay: exploiting
generational behavior to reduce cache leakage power. InISCA ’01: Proceedings of
the 28th annual international symposium on Computer archite ture, pages 240–251,
2001.
[37] Mazen Kharbutli, Keith Irwin, Yan Solihin, and Jaejin Lee. Using prime numbers
for cache indexing to eliminate conflict misses. InProceedings of the Tenth IEEE
International Symposium on High Performance Computer Archite ture, pages 288–
299, 2004.
[38] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and partition-
ing in a chip multiprocessor architecture. InProceedings of the 13th International
Conference on Parallel Architectures and Compilation Techniques, pages 111–122,
2004.
138
[39] David Kroft. Lockup-free instruction fetch/prefetchache organization. InProceed-
ings of the 8th Annual International Symposium on Computer Architecture, pages
81–87, 1981.
[40] Sanjeev Kumar and Chris Wilkerson. Exploiting spatiallocality in data caches using
spatial footprints. InISCA-25, pages 357–368, 1998.
[41] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block prediction & dead-block
correlating prefetchers. InISCA ’01: Proceedings of the 28th annual international
symposium on Computer architecture, pages 144–154, 2001.
[42] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H. Noh, Sang Lyul Min, Yookun
Cho, and Chong-Sang Kim. On the existence of a spectrum of policies that sub-
sumes the least recently used (LRU) and least frequently used (LFU) policies. In
Measurement and Modeling of Computer Systems, pages 134–143, 1999.
[43] W. Lin and S. Reinhardt. Predicting last-touch referenc s under optimal replacement.
In Technical Report CSE-TR-447-02, University of Michigan, 2002.
[44] Chun Liu, Anand Sivasubramaniam, and Mahmut Kandemir.Organizing the last line
of defense before hitting the memory wall for CMPs. InProceedings of the Tenth
IEEE International Symposium on High Performance ComputerArchitecture, page
176, 2004.
[45] Kun Luo et al. Balancing throughput and fairness in SMT processors. InIEEE
International Symposium on Performance Analysis of Systems and Software, 2001.
[46] R. L. Mattson et al. Evaluation techniques in storage hierarchies. IBM Journal of
Research and Development, 9:78–117, 1970.
139
[47] Scott McFarling. Cache replacement with dynamic exclusion. Technical Report
TN-22, Digital Western Research Laboratory, November 1991.
[48] Scott McFarling. Cache replacement with dynamic exclusion. In Proceedings of
the 19th Annual International Symposium on Computer Archite ture, pages 191–200,
1992.
[49] Scott McFarling. Combining branch predictors. Technical Report TN-36, Digital
Western Research Laboratory, June 1993.
[50] Nimrod Megiddo and Dharmendra Modha. ARC: A low overhead self tuning re-
placement cache. InUSENIX File and Storage Technologies, 2003.
[51] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. Runahead execution:
An alternative to very large instruction windows for out-of- rder processors. InPro-
ceedings of the Ninth IEEE International Symposium on High Performance Computer
Architecture, pages 129–140, 2003.
[52] N. Young. The K-server dual and loose competetiveness for paging. Algorithmica,
11(2), 1994.
[53] Kyle J. Nesbit and James E. Smith. Data cache prefetching using a global history
buffer. In HPCA ’04: Proceedings of the 10th International Symposium on High
Performance Computer Architecture, page 96, 2004.
[54] Victor F. Nicola, Asit Dan, and Daniel M. Dias. Analysisof the generalized clock
buffer replacement scheme for database transaction processing. In SIGMETRICS
’92/PERFORMANCE ’92: Proceedings of the 1992 ACM SIGMETRICS joint inter-
national conference on Measurement and modeling of computer systems, pages 35–
46, 1992.
140
[55] Vijay S. Pai and Sarita Adve. Code transformations to improve memory parallelism.
In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microar-
chitecture, pages 147–155, 1999.
[56] S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache re-
placement. InProceedings of the 21st Annual International Symposium on Computer
Architecture, pages 24–33, 1994.
[57] Jih-Kwon Peir, Yongjoon Lee, and Windsor W. Hsu. Capturing dynamic memory
reference behavior with adaptive cache topology. InProceedings of the 8th Interna-
tional Conference on Architectural Support for Programming Languages and Oper-
ating Systems, pages 134–143, 1998.
[58] Erez Perelman et al. Using SimPoint for accurate and effici nt simulation. ACM
SIGMETRICS Performance Evaluation Review, 31(1):318–319, 2003.
[59] Allan Kennedy Porterfield.Software methods for improvement of cache performance
on supercomputer applications. PhD thesis, Rice University, 1989.
[60] Prateek Pujara and Aneesh Aggarwal. Increasing the cache efficiency by eliminating
noise. InHPCA-12, 2006.
[61] T. R. Puzak.Analysis of cache replacement algorithms. PhD thesis, Univ. of Mass.,
ECE Dept., Amherst, MA., 1985.
[62] Moinuddin Qureshi, M. Aater Suleman, and Yale N. Patt. Line distillation: Increas-
ing cache capacity by filtering unused words in cache lines. In Proceedings of the
13th International Symposium on High-Performance Computer Architecture, 2007.
141
[63] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. A case for
MLP-aware cache replacement. InProceedings of the 33nd Annual International
Symposium on Computer Architecture, 2006.
[64] Moinuddin K. Qureshi, David Thompson, and Yale N. Patt.The V-Way Cache:
Demand Based Associativity via Global Replacement. InProceedings of the 32nd
Annual International Symposium on Computer Architecture, pages 544–555, 2005.
[65] R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek. A resource allocation model for
QoS management. InProceedings of the 18th IEEE Real-Time Systems Symposium
(RTSS ’97), page 298, 1997.
[66] J.A. Rivers and E.G. Davidson. Reducing conflicts in direct-mapped caches with
temporality-based design. International Conference on Parallel Processing, pages
93–103, 1996.
[67] John T. Robinson and Murthy V. Devarakonda. Data cache management using
frequency-based replacement. InSIGMETRICS ’90: Proceedings of the 1990 ACM
SIGMETRICS conference on Measurement and modeling of computer systems, pages
134–142, 1990.
[68] Sheldon Ross.A First Course in Probability. Pearson Prentice Hall, 7 edition, 2006.
[69] Yannis Smaragdakis, Scott Kaplan, and Paul Wilson. TheEELRU adaptive replace-
ment algorithm.Performance Evaluation, 53(2):93–123, 2003.
[70] A J Smith. Sequentiality and prefetching in database systems.ACM Transaction on
Database Systems, 3(3):223–247, September 1978.
[71] Alan Jay Smith. Cache memories.ACM Comput. Surv., 14(3):473–530, 1982.
142
[72] James E. Smith and James R. Goodman. A study of instruction ache organiza-
tions and replacement policies. InISCA ’83: Proceedings of the 10th annual inter-
national symposium on Computer architecture, pages 132–137, Los Alamitos, CA,
USA, 1983. IEEE Computer Society Press.
[73] Kimming So and Rudolph N. Rechtschaffen. Cache operations by mru change.IEEE
Trans. on Computers, C-37(6), June 1988.
[74] Srikanth T. Srinivasan, Roy Dz-Ching Ju, Alvin R. Lebeck, and Chris Wilkerson.
Locality vs. criticality. InProceedings of the 28th Annual International Symposium
on Computer Architecture, 2001.
[75] Srikanth T. Srinivasan and Alvin R. Lebeck. Load latency tolerance in dynamically
scheduled processors. InProceedings of the 31st Annual ACM/IEEE International
Symposium on Microarchitecture, 1998.
[76] Harold S. Stone, John Turek, and Joel L. Wolf. Optimal partitioning of cache mem-
ory. IEEE Transactions on Computers., 41(9):1054–1068, 1992.
[77] R. Subramanian, Y. Smaragdakis, and Gabriel Loh. Adaptive caches: Effective shap-
ing of cache behavior to workloads. InProceedings of the 39th Annual ACM/IEEE
International Symposium on Microarchitecture, 2006.
[78] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache mem-
ory. Journal of Supercomputing, 28(1):7–26, 2004.
[79] G. Edward Suh, Srinivas Devadas, and Larry Rudolph. A new memory monitoring
scheme for memory-aware scheduling and partitioning. InProceedings of the Tenth
IEEE International Symposium on High Performance ComputerArchitecture, page
117, 2002.
143
[80] Dean M. Tullsen and Jeffery A. Brown. Handling long-latency loads in a simul-
taneous multithreading processor. InProceedings of the 34th Annual ACM/IEEE
International Symposium on Microarchitecture, pages 318–327, 2001.
[81] Gary Tyson, Matthew Farrens, John Matthews, and AndrewR. Pleszkun. A modified
approach to data cache management. InM CRO 28: Proceedings of the 28th annual
international symposium on Microarchitecture, pages 93–103, 1995.
[82] Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems.
Using the compiler to improve cache replacement decisions.In PACT ’02: Proceed-
ings of the 2002 International Conference on Parallel Architectures and Compilation
Techniques, page 199, 2002.
[83] M. V. Wilkes. Slave memories and dynamic storage allocati n. IEEE Transactions
on Electronic Computers, 14(2):270–271, 1965.
[84] Maurice V. Wilkes. The memory gap and the future of high performance memories.
ACM Computer Architecture News, 29(1):2–7, March 2001.
[85] Wayne A. Wong and Jean-Loup Baer. Modified LRU policiesfor improving second-
level cache behavior. InProceedings of the Sixth IEEE International Symposium on
High Performance Computer Architecture, pages 49–60, 2000.
[86] Wm. Wulf and Sally McKee. Hitting the memory wall: Implications of the obvious.
ACM Computer Architecture News, 23(1):20–24, March 1995.
[87] Jun Yang, Youtao Zhang, and Rajiv Gupta. Frequent valuecompression in data
caches. InMICRO-33, pages 258–265, 2000.
144
[88] Huiyang Zhou. Dual-core execution: Building a highly scalable single-thread in-
struction window. InProceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques, pages 231–242, 2005.
[89] Huiyang Zhou and Thomas M. Conte. Enhancing memory level parallelism via
recovery-free value prediction. InProceedings of the 17th International Conference
on Supercomputing, pages 326–335, 2003.
[90] Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Rghuraman, Yuanyuan
Zhou, and Sanjeev Kumar. Dynamic tracking of page miss ratiocurve for memory
management. InProceedings of the 11th International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 177–188, 2004.
145
Vita
Moinuddin Qureshi, the son of Khalil Ahmed Qureshi and HasinQureshi, was
born in Kalyan, India on October 2, 1978. He received the Bachelor of Engineering degree
in Electronics from the University of Bombay in 2000. The following year he worked as an
IC design engineer at the Texas Instruments research and development center in Bangalore,
India. He entered the Ph.D. program in 2001 at the Universityof Texas at Austin where he
began working with his Ph.D. adviser Dr. Yale N. Patt. He received the Master of Science
degree in Electrical Engineering in 2003.
While in graduate school, he served as a teaching assistant for five semesters. He
also had several summer internships at IBM and Intel. He has published papers in The
International Symposium on Computer Architecture (ISCA-32, ISCA-33 and ISCA-34),
The International Symposium on Microarchitecture (MICRO-39), The International Sym-
posium on High Performance Computer Architecture (HPCA-13), and The International
Conference on Dependable Systems and Networks (DSN). His graduate studies were sup-
ported in part by a Ph.D. Fellowship from IBM during the academic years 2003-2006.
Permanent address: 2128 VP Street, Apt B1-502
Camp, Pune 411001 India
This dissertation was typeset with LATEX
† by the author.
†LATEX is a document preparation system developed by Leslie Lamport as a special version of Donald
Knuth’s TEX Program.
146
