



MITIGATING CACHE ASSOCIATIVITY AND COHERENCE SCALABILITY 






In Partial Fulfilment of the Requirements for the Degree of 
Doctor of Philosophy in Computer Engineering 
 
 
School of Computing and Communications 
Faculty of Engineering and Information Technology 






i | P a g e  
 
CERTIFICATE OF ORIGINAL AUTHORSHIP 
 
I certify that the work in this thesis has not previously been submitted for a degree 
nor has it been submitted as part of requirements for a degree except as part of the 
collaborative doctoral degree and/or fully acknowledged within the text. 
I also certify that the thesis has been written by me. Any help that I have received 
in my research work and the preparation of the thesis itself has been acknowledged. In 
addition, I certify that all information sources and literature used are indicated in the 
thesis. 
 
Signature of Student:   
  
Date: 




O Allah, being able to thank You is the greatest blessing You have blessed me with, 
You are the one who gave me the capability to accomplish this significant task of my life.  
I would like to express my deep and sincere gratitude to my supervisor, Dr. Zenon 
Chaczko, for his persistent guidance, encouragement, and support throughout my period 
of study at the University of Technology, Sydney. His advice was critical in helping me 
lay the right foundation for my research, and thereby develop the right perspective in 
moving forward. 
There are no words that could express my true gratitude to my beloved parents, 
wife, relatives and friends for their constant motivation, endless support and 
understanding. This thesis would have been impossible without every one of them. I 
married my wife half-way into my Ph.D. canditature, for which I appreciate her 
programming assistance, emotional support and unfaltering understanding. I extend a 
special gratitude towards my parents, the two people who raised me, who offered their 
full support, loved me and made the person I am today, for which I express my deepest 
gratitude. 
  
iii | P a g e  
 
  
iv | P a g e  
 
Abstract  
Chip Multi-Processor (CMP) designs have become dominant in the processor 
market.  The evaluation and development of  CMPs is essential  for product improvement.  
Up to date, CMPs have presented many challenges for system designers, including cache 
memory system scalability. My  research aims to implement a highly scalable CMP cache 
memory system using an associative cache, with enhanced replacement policy and a 
scalable cache coherent protocol. 
This thesis establishes a novel Adaptive Hashing and Replacement Cache (AHRC) 
design, which can maintain high associativity with an advanced method of replacement 
policy. The AHRC design can improve associativity and keep the possible number of 
locations of each block (or ways) to a minimum. For the AHRC, the Adaptive Reuse 
Interval Prediction (ARIP) replacement policy was used because of its ability to resist 
both scan and thrash. 
This research  involved simulating several workloads on a large-scale CMP with 
AHRC as the last-level cache. The results demonstrated that AHRC has better energy 
efficiency and higher performance than conventional caches. Additionally, larger caches 
that utilise AHRC are the most suitable in many-core CMPs, as they support scalability 
as opposed to smaller caches. Scalable cache coherence protocols are essential for CMPs 
systems, in order to satisfy the requirement for more dominant high-performance chips 
with shared memory. However, the limited size of the directory cache, associated with 
larger systems, may result in recurrent directory entries, evictions and invalidations of 
cached blocks thus  compromising system performance. 
  
v | P a g e  
 
This thesis proposes the Private/Shared, Read-Only/Read-Write, Invalid/Valid 
scalable coherence protocol called PROI. This novel protocol implements a slight 
modification on the caches’ tags, allowing it to differentiate between the private and 
shared data on a block granularity level. Also, PROI employs a dynamic writing policy 
with self-invalidation and self-downgrade for each L1 cache and can sustain system 
coherence and performance, scale with the raised number of cores and reduce area, 
energy, and performance associated costs with the coherence mechanism. The result 
indicates that PROI can reduce various variables, including the miss ratio of the private 
L1 cache by 17%, the network traffic, application runtime of approximately 6%, and 
energy consumption by about 35%. Therefore, utilising AHRC, ARIP, and PROI can 
mitigate the cache scalability constraints significantly and maintain the performance level 
while enhancing energy consumption of the CMP cache. 
vi | P a g e  
 
Table of Contents 
 
 
Acknowledgments ...................................................................... ii 
Abstract ....................................................................................... iv 
List of Figures ............................................................................ xi 
List of Tables ............................................................................ xiii 
List of Abbreviations .............................................................. xiv 
I. Principal Theory and Concepts ........................................... 1 
Chapter 1 Introduction .............................................................. 2 
1.1. Background ....................................................................................... 3 
1.2. Rationale ........................................................................................... 5 
1.2.1. Cache Scalability .......................................................................... 6 
1.2.2. Multicore and Mobile Technology ................................................ 8 
1.3. Aims of the Study ........................................................................... 10 
1.4. Research Hypotheses ..................................................................... 11 
1.5. Contributions to Knowledge .......................................................... 13 
1.6. List of Publications ......................................................................... 14 
1.7. Outline of Thesis............................................................................. 15 
 
  
vii | P a g e  
 
Chapter 2 Cache Scalability .................................................... 21 
2.1. Cache Associativity ........................................................................ 22 
2.1.1. Hashing based Approaches ........................................................ 26 
2.1.2. Increasing the Number of Locations Approach ......................... 27 
2.2. Replacement Policies ...................................................................... 29 
2.2.1. Overview ..................................................................................... 29 
2.2.2. Standard Replacement Policies ................................................. 31 
2.2.3. Recent Replacement Policies for CMPs ...................................... 32 
2.3. Cache Coherency ............................................................................ 39 
2.3.1. Cache Coherency Classification ................................................. 39 
2.3.2. Directory Based Protocols ........................................................... 42 
2.3.3. Coherence Deactivation .............................................................. 49 
Chapter 3 Simulation of Manycore Computer System 
Architecture ........................................................................ 53 
3.1. Research Methodology.................................................................... 54 
3.2. Simulation Tools ............................................................................. 55 
3.2.1. CSA Simulators Technology Overview ...................................... 56 
3.2.2. CSA Simulators Usefullness ...................................................... 57 
3.2.3. Taxonomy of CSA Simulators .................................................... 58 
3.2.4. CSA Simulators Quality Attributes and Evaluation Parameters . 61 
3.2.5. Simics Simulator ........................................................................ 66 
3.3. Case Study: Sniper ......................................................................... 67 
3.3.1. Overview ..................................................................................... 67 
3.3.2. Sniper Configuration .................................................................. 68 
3.3.3. Simulation Results Tools ........................................................... 69 
3.3.4. Multiple Multi-threaded Workloads .......................................... 71 
3.3.5. Scripting ..................................................................................... 71 
3.3.6. Comparison between Sniper and Graphite ................................ 71 
viii | P a g e  
 
3.4. Experimental Setup ....................................................................... 74 
3.4.1. System Setup .................................................................................. 74 
3.4.2. Evaluation Metrics ......................................................................... 75 
3.5. Conclusion ....................................................................................... 76 
II. Contribution to Research ................................................... 78 
Chapter 4 Modelling and Evaluation of Cache Coherence 
Mechanisms for Multicore Processors ........................... 79 
4.1. Introduction .................................................................................... 80 
4.2. Background and Related Work ...................................................... 83 
4.2.1. Snooping Protocols ........................................................................ 83 
4.2.2. Directory-based Protocols .............................................................. 86 
4.2.3. Token-based Protocols .................................................................... 89 
4.3. Evaluation Methodology ................................................................ 90 
4.3.1. System Setup and Validation ............................................................ 90 
4.3.2. Benchmarks ................................................................................... 92 
4.4. Analysis and Evaluations .............................................................. 92 
4.5. Conclusions ..................................................................................... 95 
Chapter 5 AHRC: An Optimised Cache Associativity ........ 97 
5.1. Overview ......................................................................................... 98 
5.2. Introduction .................................................................................... 99 
5.3. Background and Impact ............................................................... 101 
5.3.1. Hash Block Address:..................................................................... 102 
5.3.2. Skew-associative Caches: .............................................................. 102 
5.3.3. ZCache: ...................................................................................... 103 
5.3.4. Cuckoo Directories: ..................................................................... 103 
5.3.5. RRIP: .......................................................................................... 103 
  
ix | P a g e  
 
5.3.6. Allow Multiple Locations per Way: ................................................ 104 
5.3.7. Use a Victim Cache: ..................................................................... 105 
5.3.8. Use Indirection in the Tag Array: ................................................... 105 
5.4. Design of AHRC and ARIP .......................................................... 106 
5.4.1 AHRC ............................................................................................ 106 
5.4.1 ARIP ........................................................................................... 107 
5.5. Methodology .................................................................................. 109 
5.5.1. Infrastructure ............................................................................... 109 
5.5.2. Workloads ................................................................................... 110 
5.5.3. Replacement Policy ...................................................................... 112 
5.5.4. Energy and Power Models ............................................................ 112 
5.6. Results, Analysis and Evaluations .............................................. 113 
5.7. Conclusion ..................................................................................... 117 
Chapter 6 PROI: Block-based Coherence Bypass Protocol .... 119 
6.1. Chapter Summary ........................................................................ 120 
6.2. Introduction .................................................................................. 121 
6.3. Related Work ................................................................................ 125 
6.4. Background and Motivation ........................................................ 130 
6.4.1. Directory Caches ......................................................................... 130 
6.4.2. Fast Address Translation .............................................................. 131 
6.4.3. OS and TLBs Private/Shared Data Classification ............................ 132 
6.4.4. Hardware Private/Shared Classification Approaches ....................... 133 
6.5. PROI Approach ............................................................................. 135 
6.6. Experimental Setup ..................................................................... 141 
6.6.1. System Setup ................................................................................ 141 
6.6.2. Benchmarks ................................................................................. 142 
6.6.3. Evaluation Metrics ....................................................................... 144 
6.7. Performance Evaluation .............................................................. 144 
x | P a g e  
 
6.7.1. Private Blocks classification .......................................................... 144 
6.7.2. Private Cache Misses.................................................................... 145 
6.7.3. Impact on Network Traffic ............................................................. 146 
6.7.4. Execution Time ............................................................................ 147 
6.7.5. Impact on Energy Consumption ..................................................... 148 
6.8. Conclusions ................................................................................... 150 
Chapter 7 Conclusions ........................................................... 151 
7.1. Summary ....................................................................................... 152 
7.2. Thesis contribution ....................................................................... 153 
7.3. Discussion ..................................................................................... 154 
7.4. Future work .................................................................................. 156 
7.5. Final Remarks .............................................................................. 157 
III. Bibliography and publications ...................................... 159 





xi | P a g e  
 
List of Figures 
 
Figure 1.1 Moore’s law for some integrated circuits intensities (Roser 2016) ................. 3 
Figure 1.2 Multi-core performance compared to a single core (Rao 2009) ...................... 4 
Figure 1.3 General Cache Hierarchy Levels ..................................................................... 7 
Figure 1.4 Processors improvement (Batten 2016) ......................................................... 12 
Figure 1.5 Research Hypothesis ...................................................................................... 13 
Figure 2.1 Literature review high-level mind map ......................................................... 23 
Figure 2.2 Area, hit latency and hit energy factors of an 8MB set-associative cache array 
with 1 to 32 ways ............................................................................................................ 24 
Figure 3.1 Cycles Per Instruction stack .......................................................................... 69 
Figure 3.2. Power and area stacks ................................................................................... 70 
Figure 3.3 Cycle stacks plotted over time ....................................................................... 70 
Figure 4.1 Cache Coherence Problem ............................................................................. 81 
Figure 4.2 Runtime of TOKENB, SNOOPING, and DIRECTOTY using unbounded link 
bandwidth ........................................................................................................................ 94 
Figure 4.3 The endpoint Traffic of DIRECTOTY, SNOOPING, and TOKENB in 
normalised messages per miss ........................................................................................ 94 
xii | P a g e  
 
Figure 4.4 Normalised Interconnect Traffic of DIRECTOTY, SNOOPING, and 
TOKENB ........................................................................................................................ 95 
Figure 5.1 MPKI improvements over 4-way set associative caches for 16-way set 
associative, Fully Associative, and AHRC ................................................................... 114 
Figure 5.2 LLC power increase over 4-way set associative design .............................. 115 
Figure 5.3 Full system energy reduction over 4-way set associative design assuming 2W 
per core .......................................................................................................................... 116 
Figure 5.4 Comparisons of MPKI improvements over 4-way (8MB) set associative caches 
for 16-way set associative using LRU and AHRC using ARIP .................................... 117 
Figure 6.1: State Machine Diagram for the PROI protocol .......................................... 138 
Figure 6.2: Percentage of detected private date with page and block granularities ...... 145 
Figure 6.3: Cache miss rate normalised with respect to MESI ..................................... 146 
Figure 6.4: Network traffic normalised with respect to MESI...................................... 147 
Figure 6.5: Execution time normalised with respect to MESI ...................................... 149 
Figure 6.6: Energy consumption normalised with respect to MESI ............................. 149 
  
  
xiii | P a g e  
 
List of Tables 
 
Table 1.1 Thesis Part I and part II Structure ................................................................... 17 
Table 2.1 Cache coherence schemes charachteristics ..................................................... 48 
Table 3.1 Summary of Existing Simulators Categorised by Features ............................ 63 
Table 3.2 System Parameters .......................................................................................... 75 
Table 4.1 Simulation Parameters .................................................................................... 92 
Table 5.1 Main characteristics of the simulated CMPs................................................. 110 
Table 5.2 Benchmark Configurations ........................................................................... 111 
Table 6.1. Protocol states and classification ................................................................. 137 
Table 6.2. System Parameters ....................................................................................... 142 
Table 6.3. Workloads and Input Specifications ............................................................ 143 
  
xiv | P a g e  
 
List of Abbreviations 
 
AHRC Adaptive Hashing and Replacement Cache 
ARIP Adaptive Reuse Interval Prediction 
BIP Bimodal Insertion Policy 
BSD Berkeley-Style Open Source 
CC-NUMA Cache Coherent Non-Uniform Memory Architecture 
CMP Chip Multi-Processors  
CPI Cycles Per Instruction 
CPU Central Processing Unit 
CSA Computer System Architecture 
DIP Dynamic Insertion Policy 
DMA Direct Memory Access 
DPIIP Dynamic Promotion with Interpolated Increments Policy 
DRAM Dynamic Random Access Memory 
DRF Data-Race-Free 
DSM Distributed Shared Memory 
DSP Digital Signal Processor 
DVFS Dynamic Voltage and Frequency Scaling 
ECC Error Correction Codes 
EEN Explicit Eviction Notification 
FIFO First in First Out 
FS/A Full-System / Application-Level 
  
xv | P a g e  
 
ICEmon In-Cache Estimation Monitor 
IIC Indirect Index Cache 
IPC Instruction Per Cycle 
IPSEL Insertion Policy Selection 
ISA Instruction Set Architectures  
KIPS Kilo Instructions Per Second 
L1 Cache Level 1  
L2 Cache Level 2 
L3 Cache Level 3 
LIFO Last in First Out 
LIP LRU Insertion Policy 
LLC Last Level Cache 
LoD Level of Details 
LRU Least Recently Used 
MLP Memory Level Parallelism 
MPKI Misses Per 1000 Instructions 
MRU Most Recently Used 
NACK Negative Acknowledgment 
NRU Not Recently Used 
NUCA Non-Uniform Cache Access 
NUMA Non-Uniform Memory Architecture 
OPT Optimal Replacement Policy 
OS Operating System 
PC Personal Computer 
PIPP Promotion/Insertion Pseudo-Partitioning 
PPSEL Promotion Policy Selection 
xvi | P a g e  
 
PROI Private/Shared, Read-Only/Read-Write, Invalid/Valid 
PSEL Policy Selection 
PTE Page Table Entry 
QoS Quality-of-Service 
RO Read Only 
ROI Region of Interest 
RRIP Re-Reference Interval Prediction 
RRPV Re-Reference Prediction Value 
RSWEL Reconstituted SWEL 
RW Read Write 
SDM Set Dueling Monitor 
SIPP Single-step Incremental Promotion Policy 
SMP Symmetric Multi-Processor 
SMT Symmetric Multi-threaded 
SRAM Static Random Access Memory 
SWEL Protocol states are Shared, Written, Exclusivity. Level 
SWMR Single-Writer/Multiple-Readers 
TADIP Thread-Aware Dynamic Insertion Policy 
TLB Table Lookaside Buffer 
TMA Trap-based Memory Architecture 
UCP Utility-based Cache Partitioning 
UIUC/NCSA University of Illinois/NCSA Open Source License 
VIPS Valid/Invalid Private/Shared 
W watt 
  
