Drowsy Cache Partitioning for Multithreaded Systems and High Level Caches by Kenyon, Samantha Rose
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
12-2014
Drowsy Cache Partitioning for Multithreaded
Systems and High Level Caches
Samantha Rose Kenyon
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Kenyon, Samantha Rose, "Drowsy Cache Partitioning for Multithreaded Systems and High Level Caches" (2014). Thesis. Rochester
Institute of Technology. Accessed from
Drowsy Cache Partitioning for Multithreaded Systems and
High Level Caches
by
Samantha Rose Kenyon
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of
Science
in Computer Engineering
Supervised by
Assistant Professor Dr. Sonia Lopez Alarcon
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
December 2014
Approved by:
Dr. Sonia Lopez Alarcon, Assistant Professor
Thesis Advisor, Department of Computer Engineering
Dr. Amlan Ganguly, Assistant Professor
Committee Member, Department of Computer Engineering
Dr. Muhammed Shaaban, Associate Professor
Committee Member, Department of Computer Engineering
Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering
Title:
Drowsy Cache Partitioning for Multithreaded Systems and High Level Caches
I, Samantha Rose Kenyon, hereby grant permission to the Wallace Memorial Library to
reproduce my thesis in whole or part.
Samantha Rose Kenyon
Date
iii
Dedication
I dedicate this thesis to my parents, Judy and Donald Kenyon, and my entire family who
have always supported me and have done all they can to help me reach my goals.
iv
Acknowledgments
I would like to thank my adviser, Dr. Sonia Lopez Alarcon. Without her guidance and
support this work would not have been possible. I would also like to thank my committee
members, Dr. Amlan Ganguly and Dr. Muhammed Shaaban. In addition, I would like like
to thank my friends, Jason Lowden and Colin Donahue, for supporting me on this journey.
They have always been there for me through many ups and downs and I would not have
made it this far without them. Finally, I would like to thank the Computer Engineering
department. The department as a whole has provided me unending support and guidance
that has been invaluable. My degree would not have been possible without the amazing
faculty and staff in this department that has guided me through the years.
vAbstract
Drowsy Cache Partitioning for Multithreaded Systems and High Level Caches
Samantha Rose Kenyon
Supervising Professor: Dr. Sonia Lopez Alarcon
Power consumption is becoming an increasingly important component of processor design.
As technology shrinks both static and dynamic power become more relevant. This is par-
ticularly important for the cache hierarchy. The cache portion of a microprocessor contains
a large percentage of the total number of transistors in the microprocessor. Therefore the
cache consumes a large percentage of both static and dynamic power. When improving
power consumption in the past, there has always been a large trade-off between energy
savings and performance.
Techniques that reduce power consumption typically have a negative impact on perfor-
mance. Likewise, when performance is improved it is at the cost of higher energy con-
sumption. Also many current implementations only reduce one kind of power in the cache,
either static or dynamic. For a more robust approach that will remain relevant as technology
continues to shrink, both aspects of power need to be addressed.
This thesis implements a phase adaptive cache that will reduce both static and dynamic
power while having very little impact on the performance. This cache stores the most
recently used blocks in one partition that is quick and easy to access. The second partition
is placed in drowsy mode to reduce leakage power consumption. In this work, this approach
is implemented for all three levels of cache in a multicore architecture. The design is also
tested with multithreaded simulations. Brendan Fitzgerald et al. used a similar approach in
[10], however it was only for a second level cache for a single threaded application.
The results are measured using an architecture simulator. Simulations of the modified
cache structure are compared to those of a baseline unchanged cache hierarchy running
on the same machine. These results are compared for both energy savings including static
and dynamic power, along with the overall impact on performance. The results in this work
show that this cache design produces both dynamic energy and leakage energy savings with
a low performance impact.
vi
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Supporting Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Selective Cache Ways . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Cache Hierarchy Reconfiguration . . . . . . . . . . . . . . . . . . 6
2.1.3 Accounting Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Phase Adaptive Cache . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Drowsy Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Temporal Locality for Drowsy Caches . . . . . . . . . . . . . . . . 13
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 MorphCache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Efficient Cache Resizing . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 DRG-Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Location Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.5 Application Specific Low Leakage Cache . . . . . . . . . . . . . . 20
2.2.6 Multi-Core Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Drowsy Phase Adaptive Cache for L1, L2, and L3 On-Chip Cache . . . . . 22
3.1 Drowsy Phase Adaptive Cache . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vii
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 SPICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 CACTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Multi2sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Simulation Configurations . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Single Threaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 L1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 L2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 55
5.2.3 L3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 60
5.2.4 L1, L2, L3 Single Threaded . . . . . . . . . . . . . . . . . . . . . 63
5.2.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.4.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 66
5.3 Multithreaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 L1 and L2 Multithreaded Results . . . . . . . . . . . . . . . . . . 73
5.3.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 L1 and L3 Multithreaded Results . . . . . . . . . . . . . . . . . . 79
5.3.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 L2 and L3 Multithreaded Results . . . . . . . . . . . . . . . . . . 85
5.3.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.3.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 88
5.3.4 L1, L2, L3 Multithreaded Results . . . . . . . . . . . . . . . . . . 92
5.3.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 92
viii
5.3.4.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . 96
5.4 Multicore Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.0.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.0.4 Energy and Power . . . . . . . . . . . . . . . . . . . . . 107
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A Multithreaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.1 L1 Multithreaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.1.1.1 Configuration Time . . . . . . . . . . . . . . . . . . . . 120
A.1.1.2 Speedup and IPC . . . . . . . . . . . . . . . . . . . . . 121
A.1.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.1.2.1 Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 123
A.1.2.2 Leakage Energy . . . . . . . . . . . . . . . . . . . . . . 123
A.1.2.3 Total Energy . . . . . . . . . . . . . . . . . . . . . . . . 124
A.2 L2 Multithreaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2.1.1 Configuration Time . . . . . . . . . . . . . . . . . . . . 125
A.2.1.2 Speedup and IPC . . . . . . . . . . . . . . . . . . . . . 126
A.2.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.2.2.1 Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 128
A.2.2.2 Leakage Energy . . . . . . . . . . . . . . . . . . . . . . 128
A.2.2.3 Total Energy . . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 L3 Multithreaded Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.3.1.1 Configuration Time . . . . . . . . . . . . . . . . . . . . 130
A.3.1.2 Speedup and IPC . . . . . . . . . . . . . . . . . . . . . 131
A.3.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.3.2.1 Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 132
A.3.2.2 Leakage Energy . . . . . . . . . . . . . . . . . . . . . . 133
A.3.2.3 Total Energy . . . . . . . . . . . . . . . . . . . . . . . . 134
ix
List of Tables
3.1 Possible configurations of 4 way cache hierarchy . . . . . . . . . . . . . . 23
3.2 Possible configurations of 8 way cache hierarchy . . . . . . . . . . . . . . 23
3.3 Possible configurations of 16 way cache hierarchy . . . . . . . . . . . . . . 23
4.1 CACTI latency results for L1 . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Possible configurations and latency of L1 . . . . . . . . . . . . . . . . . . 32
4.3 CACTI latency results for L2 . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Possible configurations and Latency for L2 . . . . . . . . . . . . . . . . . 32
4.5 CACTI latency results for L3 . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Possible configurations and Latency for L3 . . . . . . . . . . . . . . . . . 34
4.7 CACTI L1 Energy Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 CACTI L1 Final Energy Results . . . . . . . . . . . . . . . . . . . . . . . 37
4.9 CACTI L2 Energy Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.10 CACTI L3 Energy Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.11 CACTI L3 Final Energy Results . . . . . . . . . . . . . . . . . . . . . . . 40
4.12 Memory Configurations for All Simulations . . . . . . . . . . . . . . . . . 42
4.13 Processor Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.14 Spec2006 Benchmarks Used [6] . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Spec2006 Benchmark Performance Numbers . . . . . . . . . . . . . . . . 44
5.1 Simulation Configurations and Descriptions . . . . . . . . . . . . . . . . . 45
5.2 Access Costs for L3 Configurations . . . . . . . . . . . . . . . . . . . . . 68
5.3 Experiments with Four Threads . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Multicore Experiements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xList of Figures
2.1 Hardware Design for a Single Way in a Selective Way Cache [3] . . . . . . 5
2.2 Example Partitioning for 256KB Banks [5] . . . . . . . . . . . . . . . . . 7
2.3 Possible Cache Configurations for a single 256KB structure [5] . . . . . . . 8
2.4 Possible configurations of a 4-way cache and swapping blocks [7] . . . . . 9
2.5 Different Clock Domains within Phase Adaptive Cache Design [8] . . . . . 11
2.6 Drowsy memory circuit diagram [11] . . . . . . . . . . . . . . . . . . . . 12
2.7 Drowsy Cache Line Implementation Logic [11] . . . . . . . . . . . . . . . 14
2.8 MorphCache Example Four Core Topology [18] . . . . . . . . . . . . . . . 15
2.9 Resizable Cache Architecture [13] . . . . . . . . . . . . . . . . . . . . . . 16
2.10 DRG-Cache Gated-Ground Transistor [1] . . . . . . . . . . . . . . . . . . 18
2.11 Location Cache Example Hierarchy [15] . . . . . . . . . . . . . . . . . . . 19
3.1 Example of cache partitioning [10] . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Example access pattern with MRU counters and states [10] . . . . . . . . . 25
4.1 SRAM Cell used for SPICE simulations [10] . . . . . . . . . . . . . . . . 29
4.2 SPICE Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Plot and equation for L1 Latency . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Plot and equation for L2 Latency . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Plot and equation for L3 Latency . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 L1 Energy Plot and Equations . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 L2 Energy Plot and Equations . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8 L3 Energy Plot and Equations . . . . . . . . . . . . . . . . . . . . . . . . 39
4.9 Multiple Core Configuration [2] . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Speedup of L1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 L1 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Total Dynamic Energy Savings of L1 Simulations . . . . . . . . . . . . . . 50
5.5 Total Leakage Energy of L1 Simulations . . . . . . . . . . . . . . . . . . . 51
5.6 Total Energy Savings of L1 Simulations . . . . . . . . . . . . . . . . . . . 52
5.7 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xi
5.8 Speedup of L2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.9 L2 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.10 Total Dynamic Energy Savings of L2 Simulations . . . . . . . . . . . . . . 56
5.11 Total Leakage Energy of L2 Simulations . . . . . . . . . . . . . . . . . . . 57
5.12 Total Energy Savings of L2 Simulations . . . . . . . . . . . . . . . . . . . 57
5.13 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.14 Speedup of L3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.15 L3 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.16 Total Dynamic Energy Savings of L3 Simulations . . . . . . . . . . . . . . 61
5.17 Total Leakage Energy of L3 Simulations . . . . . . . . . . . . . . . . . . . 62
5.18 Total Energy Savings of L3 Simulations . . . . . . . . . . . . . . . . . . . 62
5.19 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.20 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.21 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.22 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.23 Total Dynamic Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . 67
5.24 Total Dynamic Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . 67
5.25 Total Dynamic Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . 69
5.26 Total Leakage Energy of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 69
5.27 Total Leakage Energy of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 70
5.28 Total Leakage Energy of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 70
5.29 Total Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 71
5.30 Total Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 72
5.31 Total Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 72
5.32 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.33 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.34 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.35 Total Dynamic Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . 76
5.36 Total Dynamic Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . 77
5.37 Total Leakage Energy of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 78
5.38 Total Leakage Energy of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 78
5.39 Total Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 79
5.40 Total Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 79
5.41 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.42 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.43 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xii
5.44 Total Dynamic Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . 82
5.45 Total Dynamic Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . 83
5.46 Total Leakage Energy of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 84
5.47 Total Leakage Energy of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 84
5.48 Total Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 85
5.49 Total Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 85
5.50 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.51 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.52 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.53 Total Dynamic Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . 89
5.54 Total Dynamic Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . 89
5.55 Total Leakage Energy of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 90
5.56 Total Leakage Energy of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 90
5.57 Total Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 91
5.58 Total Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 91
5.59 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.60 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.61 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.62 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.63 Total Dynamic Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . 97
5.64 Total Dynamic Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . 97
5.65 Total Dynamic Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . 98
5.66 Total Leakage Energy of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 98
5.67 Total Leakage Energy of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 99
5.68 Total Leakage Energy of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 99
5.69 Total Energy Savings of L1 Cache . . . . . . . . . . . . . . . . . . . . . . 100
5.70 Total Energy Savings of L2 Cache . . . . . . . . . . . . . . . . . . . . . . 100
5.71 Total Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 101
5.72 L1-1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.73 L1-2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.74 L2-1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.75 L2-2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.76 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.77 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.78 Total Dynamic Energy Savings of L1-1 Cache . . . . . . . . . . . . . . . . 107
5.79 Total Dynamic Energy Savings of L1-2 Cache . . . . . . . . . . . . . . . . 108
xiii
5.80 Total Dynamic Energy Savings of L2-1 Cache . . . . . . . . . . . . . . . . 108
5.81 Total Dynamic Energy Savings of L2-2 Cache . . . . . . . . . . . . . . . . 109
5.82 Total Dynamic Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . 109
5.83 Total Leakage Energy of L1-1 Cache . . . . . . . . . . . . . . . . . . . . . 110
5.84 Total Leakage Energy of L1-2 Cache . . . . . . . . . . . . . . . . . . . . . 110
5.85 Total Leakage Energy of L2-1 Cache . . . . . . . . . . . . . . . . . . . . . 111
5.86 Total Leakage Energy of L2-2 Cache . . . . . . . . . . . . . . . . . . . . . 111
5.87 Total Leakage Energy of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 112
5.88 Total Energy Savings of L1-1 Cache . . . . . . . . . . . . . . . . . . . . . 112
5.89 Total Energy Savings of L1-2 Cache . . . . . . . . . . . . . . . . . . . . . 113
5.90 Total Energy Savings of L2-1 Cache . . . . . . . . . . . . . . . . . . . . . 113
5.91 Total Energy Savings of L2-2 Cache . . . . . . . . . . . . . . . . . . . . . 114
5.92 Total Energy Savings of L3 Cache . . . . . . . . . . . . . . . . . . . . . . 114
A.1 L1 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.2 Speedup for L1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.3 L1 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.4 Total Dynamic Energy Savings of L1 Simulations . . . . . . . . . . . . . . 123
A.5 Total Leakage Energy of L1 Simulations . . . . . . . . . . . . . . . . . . . 124
A.6 Total Energy Savings of L1 Simulations . . . . . . . . . . . . . . . . . . . 124
A.7 L2 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.8 Speedup of L2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.9 L2 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.10 Total Dynamic Energy Savings of L2 Simulations . . . . . . . . . . . . . . 128
A.11 Total Leakage Energy of L2 Simulations . . . . . . . . . . . . . . . . . . . 129
A.12 Total Energy Savings of L2 Simulations . . . . . . . . . . . . . . . . . . . 129
A.13 L3 Config Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.14 Speedup of L3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.15 L3 MRU Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.16 Total Dynamic Energy Savings of L3 Simulations . . . . . . . . . . . . . . 133
A.17 Total Leakage Energy of L3 Simulations . . . . . . . . . . . . . . . . . . . 133
A.18 Total Energy Savings of L3 Simulations . . . . . . . . . . . . . . . . . . . 134
1Chapter 1
Introduction
As manufacturing technology is improved and transistors shrink, the number of transistors
on a chip has increased exponentially. Power consumption has now become a major factor
in microprocessor design. Performance improvements in current designs are limited by
their power consumption and therefore improvements in speed have diminished over the
years.
Memory elements in a microprocessor do not improve as fast as processors. Due to this
bottleneck, a large percentage of transistors are used to increase the size of local storage,
the cache hierarchy. This directly results in the cache consuming a large percentage of
overall power in the microprocessor. The cache hierarchy contains up to 35% of the total
number of transistors on a processor [10], making power consumption within the cache
critical.
Previous efforts have focused on either static power or dynamic power consumption, ap-
plying techniques that modify the architecture, the circuit design, or the actual transistor
manufacturing. Also previous efforts have either focused on increasing performance or
reducing energy due to the inherent trade-offs between the two. An increase in perfor-
mance will correspond to an increase in energy consumption. Likewise decreasing power
2consumption will correlate to a decrease in performance. This work looks to use a combi-
nation of techniques to reduce both static and dynamic power while only slightly reducing
the performance. By combining transistor level circuit analysis and cache architecture,
both static and dynamic power can be saved with an acceptable performance hit.
This is done using the drowsy cache partitioning scheme proposed previously and imple-
mented in the second level cache [10]. This work exploits the temporal locality of cache
memory in which 92% of all cache accesses are made to the most recently used (MRU)
cache line. Extending that idea shows that 98% of accesses are made to the two most re-
cently used cache lines [16]. This information can then be used to design a cache hierarchy
with two partitions. One partition, the A partition, remains in an active state, while the
other partition, the B partition, is in a drowsy state. The partitions are also phase adaptive
and can therefore determine the ideal cache partitions using cost functions. There are mul-
tiple cost functions based on either the energy consumption or the energy-delay product.
Although each partition can dynamically alter its associativity the total set associativity of
the cache remains unchanged.
This cache design saves dynamic power by accessing the A partition first, which is only
a portion of the total cache, and then accessing the B partition only when the data needed
is not found in the A partition. The most recently used blocks are kept in A by swapping
data between partitions. When the data is not found in A but is found in B, that data is then
placed in the A partition and swapped with the least recently used block found in A. The
same is done for data found in other levels of cache.
This design also saves static power because the B partition is placed in a drowsy state. This
means that the B partition is kept at a lower voltage. This lower voltage is low enough
to reduce power consumption but does not lose data or require data be written back. This
3allows for a modest decrease in the overall performance and reduces the overall static power
consumption of the cache hierarchy.
In this work, this cache design is applied to all three levels of the cache in both multi-
threaded and multicore architectures. Due to the differences in accesses, size, and behavior
it is useful to look at the affect this design will have on both energy and performance for
all three levels.
The rest of this thesis is organized as follows: Chapter 2 discusses supporting work as well
as previous designs that have been proposed. Chapter 3 outlines the proposed design for
the drowsy phase adaptive cache. Chapter 4 explains the implementation of this design as
well as the testing and simulation environment. Chapter 5 shows results for single threaded,
multithreaded, and multicore simulations. Finally, Chapter 6 provides conclusions that can
be drawn from these results.
4Chapter 2
Supporting Work
There have been many advancements in cache architecture design through the years. The-
ses advancements have allowed for new cache designs and implementations, including what
is proposed here. Many of these works laid the foundation for low-power cache design,
while others focus on high-performance design. The ideas presented here are extended and
applied in ways similar to the work presented in this thesis and have been instrumental in
this cache design.
2.1 Previous Work
2.1.1 Selective Cache Ways
The idea of selective cache ways [3], first presented by Albonesi, was the first step in the
development of phase adaptive cache hierarchies. In this hierarchy, a controller is used to
turn on and off certain cache ways. The number of ways being used at any given time is
dependent upon the number of instructions per cycle (IPC). As the IPC decreases, the num-
ber of ways being used increases to improve the performance. This is implemented with
hardware and software modifications. Figure 2.1 shows the hardware modifications that
need to be made to a basic cache. There are no modifications made to the tag portion of the
hardware. The data portion represents roughly 90% of the entire cache power dissipation,
therefore this implementation focuses on energy savings within the data portion only [3].
5It can be seen in Figure 2.1 that the data is divided into four separate way elements. A
Cache Way Select Register (CWSR) contains one bit for each of those ways to indicate
which ways to enable. This information is then sent to the cache controller which enables
or disables the appropriate ways. When a way is disabled no data is selected from that way
and therefore that way essentially dissipates no dynamic power.
Figure 2.1: Hardware Design for a Single Way in a Selective Way Cache [3]
A Performance Degradation Threshold (PDT) is used here to determine when to enable and
disable certain ways. This is set to either 2%, 4%, or 6%. When the IPC is projected to
fall below this threshold with the current system, an additional way is then enabled. This
allows for the performance of the system to be configured using different threshold values
6to obtain a desired performance.
This implementation was very successful and reduced dynamic cache energy dissipation
by 40% while only incurring a 2% performance hit. It is only reducing dynamic power
however, and is not sufficient as static power becomes more important.
2.1.2 Cache Hierarchy Reconfiguration
The implementation in [3] is inherently limiting because it only allows for enabling or
disabling a single way as the IPC changes. This does not take into account situations in
which an application would transition from very low cache activity to very high cache
activity [5]. This can occur often within a given application, especially when thrashing
occurs. Balasubramonian et al. in [5], attempts to enhance Albonesi’s design in [3], by
removing this limitation.
In [5] Balasubramonian et al. implement a cache that can be dynamically reconfigured by
implementing a one level cache at the physical layer and a two level virtual cache structure.
This is done by taking a 2MB cache and splitting it into two 1MB banks. These banks
are then broken down even farther into 256KB banks. This design for the 256KB banks is
shown in Figure 2.2.
The figure in Figure 2.3 shows the possible ways to partition a cache structure with one
physical level and two virtual levels for a single 256KB structure. Each block in Figure 2.3
represents a 128KB cache bank. Each of these banks has a top line that behaves like a direct
mapped level 1 cache. The bottom line then behaves like a two-way set associative level 1
cache. This structure replaces the traditional cache structure. It also allows for reduction
in energy dissipation while maintaining the same overall cache size with just a reduction in
the number of ways.
7Figure 2.2: Example Partitioning for 256KB Banks [5]
The access protocol for this cache set up varies from [3]. When there is a hit in L1 a single
way is being accessed and the data is returned. When there is a miss in L1 all of the tag
arrays of L2 are now read in parallel to increase performance. If the data is found in L2
then the block is swapped with the block in the specific way of L1 and the location where
the data was found in L2. If the data is not found in L2 the data currently in the specified
way of L1 is moved to L2 and new data is moved into L1. This guarantees that the cache is
non-inclusive.
The cache is first initialized to the smallest possible size which is a direct mapped 256KB
L1. The initial state is set to be unstable requiring the cache performance be compared
8Figure 2.3: Possible Cache Configurations for a single 256KB structure [5]
to the accepted tolerance levels. The tolerance rates of this designed are based on the
IPC as well as the miss rate and the branch frequency. When the cache does not reach
the accepted tolerance rates the cache size is increased to the next largest size. This will
continue until the cache reaches the maximum size or all of the current working sets fit
within the configuration. The configuration is then set to stable after the number of misses
and branches remains stable. If the configuration then changes out of the stable state and is
set to be unstable then the cache is again set to the smallest configuration. Figure 2.3 shows
a detailed view of this process for a L1/L2 setup.
This implementation reduced the Cycles Per Instruction (CPI) by 15% when setup for
L1/L2 design. It also was able to reduce energy dissipation by 43% when setup for L2/L3
design.
92.1.3 Accounting Cache
The accounting cache [7] represents another contribution to low power cache design. This
design is intended to leverage locality and LRU data rather than IPC to dynamically change
the cache structure, reducing dynamic power. This implementation requires MRU counters
for each of the number of ways in the cache. These counters keep track of the number of
hits at each of these ways. This implementation keeps track of the energy and performance
cost of the cache based on the current configuration. Performance Degradation Threshold
(PDT) is also used here to determine when to change cache configurations. They values
used for this system are 1.5%, 6.2%, and 25%.
The overall system is divided into two partitions; the A partition and the B partition. The
number of ways in each partition determines the current configuration. Figure 2.4 shows
the possible configurations for a 4-way cache of this type and also swapping of the cache
blocks.
Figure 2.4: Possible configurations of a 4-way cache and swapping blocks [7]
The access protocol is modified here to account for accessing both partitions. First the
A partition is accessed. If the block is found in A, the corresponding MRU counter is
incremented and the data is returned. If the data is not found in the A partition the B
10
partition is then searched along with the next level in the cache hierarchy. If the data
is found in the B partition, the corresponding MRU counter is updated and the block is
returned. The block is also swapped into the A partition at this time, with the LRU block
in the A partition. If the data is not found in B, but is found in the next level of the cache
hierarchy, then this block displaces the LRU block of the A partition. This displaced block
then replaces the LRU block of the B partition. The displaced block from the B partition is
then written to the next level. This design ensures that the A partition always contains the
most recently used blocks.
This LRU based design provides additional information if the data is expressed using its
dual, the most recently used (MRU) ordering. The most recently used way is referred to
as MRU 0 and then next most recently used way is referred to as MRU 1 and so on. This
means that hits to either MRU 0 or MRU 1 would represent a hit in a 2-way set associative
cache. This idea can then be extended to determine the hit ratio for any partitioning of the
cache system. The hits that occur in each MRU state provide the hit information for each
partition and can therefore be used to determine the next optimal configuration.
This design is successful, resulting in about a 30% energy savings in L1 and a 60% energy
savings in L2. This implementation, however does not reduce static power.
2.1.4 Phase Adaptive Cache
The accounting cache was then refined into a Phase-Adaptive Cache [8]. This cache dy-
namically adjusts its size and speed based on previous use, to increase performance. This
design is asynchronous while allowing for different operating frequencies within domains
within a processor resulting in a locally synchronous (GALS) design. Figure 2.5 shows the
architecture for this implementation.
11
Figure 2.5: Different Clock Domains within Phase Adaptive Cache Design [8]
Each section of Figure 2.5 is operating with a different clock period. This allows each of
these pieces to run at a different operating frequency. The latency for the A partition can
therefore remain the same while the frequency is modified. This means that when the A
partition is at its smallest the frequency of the system is at its greatest, which results in a
large variation in the latency to access the B partition. When the size of the A partition is
small, the latency to B can be three or more times as high.
The phase-adaptive cache is compared against a program-adaptive cache as well as a static
cache design. The program-adaptive cache changes from benchmark to benchmark but
does not dynamically change during benchmark execution. Of the three designs, the static
cache design performs the worst. The phase-adaptive cache outperforms the program-
adaptive cache, in most cases. The program-adaptive cache does perform better than the
phase-adaptive design in a few cases. This is due to the specific nature of the benchmarks
12
and the higher branch prediction accuracy present in the program-adaptive implementation.
2.1.5 Drowsy Cache
There is one method that has been explored for reducing static power. All of the previous
works explored focus on reducing dynamic power, however, as technology sizes shrink
below 0.1 µm static power will begin to dominate the total power dissipation [14]. The
Drowsy Cache design [11] is able to significantly reduce leakage power, with a minimal
performance penalty. This involves putting a SRAM cell in a drowsy state to try and reduce
subthreshold leakage. The drowsy memory circuit is shown in Figure 2.6.
Figure 2.6: Drowsy memory circuit diagram [11]
The drowsy circuit shows an SRAM cell that either receives the normal power supply
voltage or it receives a lower voltage. Dynamic voltage scaling (DVS) can be used to
lower the supplied voltage to the SRAM cell while preserving its state as shown in [14].
This is very important because preserving the state allows for less of a penalty and less of
a performance hit than using other gated methods.
13
There are two cross-coupled inverters in the SRAM cell as shown in Figure 2.6. This
results in two different leakage current paths. The dominate current can be modeled as
shown below:
ID = Ise
VGS−VT
nkT/q (1− e−
VDS
kT/q )(1 + λVDS) (2.1)
where λ is the channel-length modulation parameter. The overall leakage current is then
derived from Equation 2.1 and is shown below:
IL = ((ISN + ISP ) + (ISNλN + ISPλP )VDD)× (1− e−
VDD
nkT/q ) (2.2)
where ISN and ISP represent the nMOS and pMOS off-transitor current factors independent
from Equation 2.1. From Equation 2.2 it can be seen that IL depends largely on VDD.
Therefore a slight reduction in VDD will have a large impact on the leakage current within
the SRAM cell. From there Kim et al. choose a voltage that is 50% higher than their
threshold voltage. This is to guarantee that the state of the data is preserved and is a
conservative approximation. With this, drowsy voltage leakage is reduced by about 80%.
To implement this drowsy cache design additional logic is needed around the SRAM cell.
Figure 2.7 shows the logic required for a single drowsy cache line.
The Drowsy state allows for a lower performance penalty than if the design had a gated
VDD. If the tag array design is normal then there is just a 1 cycle penalty to raise the voltage
on the cache line to read the data. If the tag array also uses drowsy SRAM cells then there
is a 3 cycle penalty to raise the voltage in the line. This cache design allows for a 50%
reduction in energy used by the cache design with a very small decrease in performance.
2.1.6 Temporal Locality for Drowsy Caches
There has also been important research done about temporal locality in caches specifically
related to this idea of a drowsy cache. Petit el al. [16] looked at temporal locality to
14
Figure 2.7: Drowsy Cache Line Implementation Logic [11]
determine how to best use a drowsy cache design. They found that the majority of the
accesses to the cache occur to the blocks that are most recently used, specifically the most
recently used states 0 and 1. When looking at the first MRU line they found 92% of the
total hits occurred there. When extending this to the second MRU line they found 98%
of the total hits occurred there. This information can therefore be used to safely determine
which areas of the cache to put into a drowsy state to incur the smallest performance penalty
possible.
2.2 Related Work
There are others who have tried to do similar work using these existing results. Most of
the existing implementations are designed to either save static or dynamic power. The
implementation proposed here strives to save both dynamic and leakage power. Previous
implementations designed to save leakage power are not state preserving and therefore
15
incur a much higher performance hit than the design presented here.
2.2.1 MorphCache
Srikantaiah et al. propose a reconfigurable adaptive multi-level cache design called a
MorphCache [18]. This cache dynamically changes the cache hierarchy to improve per-
formance. The initial configuration is per-core L2 and L3 cache slices. From there the
MorphCache either merges slices or splits them, dynamically, to change the slices and
configurations available to each core.
Figure 2.8: MorphCache Example Four Core Topology [18]
Figure 2.8 shows a MorphCache for a system with four cores where each core has a private
L1 and close access to L2 and L3 slices. Figure 2.8 also shows that the MorphCache is
capable of adapting into both a symmetric and asymmetric topology.
The MorphCache will merge slices when either one slice is highly-utilized while the other
is under-utilized, or both are highly-utilized and shared by threads sharing the same address
space. Merging must be done carefully though, because merging two L2 slices when L3
slices are split, could result in the L2 capacity being larger than L3. To avoid this the
MorphCache only merges L2 slices if the split L3 slices can also be merged as well. When
16
splitting slices it is necessary to ensure that splitting L3 does not results in a smaller L3
than L2 similar to the merging requirements. Therefore, the MorphCache only splits L3 if
the corresponding merged L2 slices can be split as well.
This implementation is focused on improving performance. The work proposed here differs
from this because it is focused on saving power with a small performance hit. Also the
MorphCache requires modifications to the interconnection network where as the proposed
design does not.
2.2.2 Efficient Cache Resizing
Keramindas et al. proposes a new framework for efficient cache resizing in terms of both
power and performance [13]. The cache proposed here dynamically reconfigures memory
based on the behavior of the running application. Figure 2.9 shows the overall architecture
for this cache design.
Figure 2.9: Resizable Cache Architecture [13]
17
The two mask registers, set-mask and way-mask, control both the horizontal and vertical
resizing. Each bit in the way-mask register controls enabling and disabling that corre-
sponding cache way. For horizontal resizing additional cache access logic is needed. The
number of sets, in a conventional cache, defines both the tag bits and index used to look up
and place a cache block. The set-mask register s is used to ensure that the correct number of
index bits is being used for a particular cache size. When downsizing, the number of index
bits decreases and vice versa. The number of tag bits increases as cache size decreases,
therefore there must be as many tag bits as needed for the smallest cache size possible for
this design.
In a typical resizing cache, when the cache size is decreased the discarded part of the
cache is immediately deactivated. Instead, here the discarded part is kept active and they
gradually deactivate. When all of these parts are deactivated, the transition has officially
completed. This protects against miss rate hiccups that generally occur immediately after
resizing. This can severely hurt performance. Overall this method results in cache size
reduction of anywhere from 13% to 30% with a very low performance impact of 4% to
10%.
This work focuses on reducing the area of a cache in a way that is efficient in terms of both
energy and performance. The implementation proposed here differs from this because it
focuses saving leakage and dynamic energy, by modifying the cache configuration, while
keeping the cache area the same.
2.2.3 DRG-Cache
Agrwal et al. proposes a Data Retention Gated-Ground Cache, or a DRG-Cache. This
cache uses gated-ground techniques to reduce leakage power consumption in cache mem-
ory systems [1]. The gated-ground technique in [17] inserts an NMOS transistor between
18
the ground line and the SRAM cell. This allows for the supply voltage to be effectively
off, substantially reducing leakage energy dissipation. In the DRG-Cache this transistor is
connected to the row decoder which then signals which cells are in standby mode, where
the supply is effectively off, and which are in active mode. This design is shown in Figure
2.10.
Figure 2.10: DRG-Cache Gated-Ground Transistor [1]
This technique allows for a large reduction in leakage energy, however it increases dynamic
energy for read and write operations. Also, turning off the supply voltage results in the pos-
sibility of destroying data stored in the SRAM cell. Another component to this technique
is determining the optimal size of this Gated-Ground transistor. Increasing the size of this
transistor improves performance as well as data retention but lowers the overall power sav-
ings. Since data retention is very important in this application the total energy savings is
limited because the size of the transistor will remain large to retain data. The work pro-
posed here differs from this because it is state-preserving. Also the proposed work aims to
save both dynamic and leakage power whereas this work only saves leakage power while
increasing dynamic energy.
2.2.4 Location Cache
Jason Nemeth et al. proposes a location cache that has a low power second level cache,
using gated-ground techniques [15]. Combining these two ideas, the goal is to save on
both static and dynamic power. This second level is using the gated-ground techniques
discussed in [17] similarly to the DRG-Cache in [1]. This implementation combines this
19
low power structure with a location cache. This location cache is a direct-mapped cache
that provides the second level cache with accurate access way location information. The
additional cache runs in parallel with the L1 cache. This reduces L2 power consumption
more than the typical set-associative cache. This memory hierarchy is shown in Figure
2.11.
Figure 2.11: Location Cache Example Hierarchy [15]
The access pattern for this cache first determines if there is a hit in L1. If there is, the result
obtained from the location cache, running in parallel, is discarded. If there is a miss in
L1 and a hit in the location cache, L2 is accessed as a direct-mapped cache, reducing the
access power. If there is a miss in both L1 and the location cache, then L2 is accessed as
a conventional set-associative cache. In this case the data in the location cache is updated.
This idea is further extended to work with multicore architectures. There are two ways to
implement this architecture. The cores share L2 and also share the same location cache or
20
all cores share only one location cache. This implementation is shown to save between 2%
and 43% of total power, however it has the same limitations as the DRG-Cache [1] because
it uses the same gated-ground technique.
2.2.5 Application Specific Low Leakage Cache
Farahani et al. proposes a modification to the general drowsy cache configuration [9]. This
work proposes two different methods for saving leakage in a cache. In the first design,
cache words are placed in a drowsy mode at the end of a period of time called an update
window (UW). When a word is needed, only that word is brought back up to an active
voltage level and the rest of the cache line remains in a drowsy state. In this design, they
are able to save on the penalty of activating the entire line while still saving energy by
keeping the other words in the line in a drowsy state. This could, however, perform worse
than the standard drowsy cache if more than one word in the cache line is needed. In this
case, the penalty will be higher for waking up more than one word, than if the entire cache
line had been activated.
The second approach identifies active words and does not place them into the drowsy mode
at the end of a UW. This allows for a performance savings as well as leakage savings, with
the words that are placed in the drowsy state. A single status-bit is used here to record
whether or not a word is active. This bit is set high when a word is read and at the end of
a UW all words with a status-bit of zero are placed in the drowsy state. At the end of the
next UW the status-bit is set to zero.
The first implementation reduces leakage power by an average of 88% with an average
performance loss of just 0.7%. The second implementation reduces leakage power by an
average of 89% with an average performance loss of 0.5%. Although this implementation
reduces leakage power significantly with a low performance penalty it does not reduce
21
dynamic power at all.
2.2.6 Multi-Core Analysis
Jorge Albericio [2] has recently purposed a reuse cache. This cache uses locality to only
store data that is prone to being reused. Jorge Albericio’s work focuses on multicore archi-
tectures in which the last level of cache is shared among multiple cores. He proposes using
reuse locality, rather than temporal locality, along with modified least recently used (LRU)
and not recently used (NRU) replacement policies. These modified replacement policies,
least recently reused (LRR) and not recently reused (NRR), perform better than their coun-
terparts (LRU and NRU) in a multicore system with a high workload. This work explores
locality in a shared cache multicore environment, however it does not propose an energy
savings. The architecture used here will be used to also test the drowsy phase adaptive
cache in a multicore environment.
2.3 Summary
This chapter discussed supporting work that is related to the work presented here. It also
presented previous work that has attempted to achieve similar goals. These previous works
aim to save power in the memory hierarchy while incurring a small performance penalty.
The next chapter provides an in depth explanation of the design proposed in this work. It
also shows how the supporting work discussed in this chapter is used in this final design.
22
Chapter 3
Drowsy Phase Adaptive Cache for L1, L2, and
L3 On-Chip Cache
The drowsy phase adaptive cache presented in this thesis is a combination of the phase
adaptive cache and the drowsy cache. The phase adaptive cache attempts to save dynamic
power while incurring a modest performance penalty. The drowsy cache attempts to save
leakage energy while again incurring a small performance hit. The drowsy phase adaptive
cache combines these designs to save both dynamic power and leakage power in the cache
while incurring a small performance penalty. This work extends upon the drowsy phase
adaptive cache proposed in [10] to multithreaded and multicore applications as well as all
levels of cache.
3.1 Drowsy Phase Adaptive Cache
The accounting cache proposed in [7] is reduced for this work limiting the possible config-
urations to powers of two. The number of possible configurations will vary based on the
associativity of each cache level. For a four way set-associative cache, the configurations
are 1/3, 2/2, and 4/0, where each number represents A/B partitions as shown in Table 3.1.
These configurations will be referenced as C0, C1, and C2.
For an 8 way set-associative cache the configurations are 1/7, 2/6, 4/4, 8/0, which will be
23
Name A partition B partition
C0 1-way 3-way
C1 2-way 2-way
C2 4-way 0-way
Table 3.1: Possible configurations of 4 way cache hierarchy
referenced as C0, C1, C2, and C3 as shown in Table 3.2.
Name A partition B partition
C0 1-way 7-way
C1 2-way 6-way
C2 4-way 4-way
C3 8-way 0-way
Table 3.2: Possible configurations of 8 way cache hierarchy
Finally, for a 16 way set-associate cache the configurations are 1/15, 2/14, 4/12, 8/8, 16/0
which will be referenced as C0, C1, C2, C3, C4 as shown in Table 3.3.
Name A partition B partition
C0 1-way 15-way
C1 2-way 14-way
C2 4-way 12-way
C3 8-way 8-way
C4 16-way 0-way
Table 3.3: Possible configurations of 16 way cache hierarchy
This allows for different possible configurations that resize the A partition to powers of
two. These three associativities will be the total associativity of the different cache levels.
Cache level 1 will be a 4 way set-associative cache, while level 2 will be an 8 way-set
associative cache, while level 3 will have 16 ways. The total number of ways in the cache
never changes, however, the different configurations allow for each partition to have a
varying number of ways. Figure 3.1 shows an example for the different number of ways
each partition can have. This example is for the second cache level. In this figure the yellow
area represents partition A and the green area represents partition B. Notice that in the C0
configuration partition A is actually acting as a direct mapped cache with only one way,
24
however partition B is fully phase adaptive with a total of 7 ways.
Figure 3.1: Example of cache partitioning [10]
The access protocol for this cache design takes advantage of the dual partition structure.
The cache first looks for the necessary data in the A partition. MRU counters are used to
control when the cache changes between different configurations. If the data is found in
the A partition, a corresponding MRU counter is updated and the data is returned. If the
data is not found in A, the B partition is searched. If it is found in the B partition, the MRU
counter is updated and this block is then swapped into the A partition. If the data is not
found in the B partition and is found in the next level, then the block is brought up to the A
partition and a block from the A partition is evicted into the B partition. A block from the
B partition is then also evicted into the next level. After this the MRU states are updated to
reflect the changing states. Figure 3.2 shows an example access pattern with updates to the
MRU counters and states for a 4-way set-associate cache.
The blocks of color show the current MRU state. MRU state 3 is equivalent to the least
recently used state (LRU). First, block B is accessed and its MRU counter is increased.
Since this data is located in the A partition, the green area, this is all that is necessary.
From there, block C is accessed and the MRU state counter is incremented. In this case,
that is the counter for MRU state 2 represented by the purple block. However block C is in
the B partition, therefore C needs to be swapped into A. From there C is accessed again and
the MRU state counter is again increased. However this time it is the counter for MRU 0,
25
Figure 3.2: Example access pattern with MRU counters and states [10]
represented by the yellow block, because it was swapped with A. C was swapped because
it was the most recently used block, which always needs to be kept in the A partition. A
was chosen because A was the least recently used block in the A partition. Finally D is
accessed and the counter for MRU state 3 is increased [10].
The phase adaptive portion of this design is the same as in [10]. A set number of instruc-
tions represents a phase. After each phase, statistics are collected and a new configuration
is determined. In this design, a phase is 15,000 instructions. This provides small enough
granularity without reducing performance [10]. To ensure that the design is not constantly
changing its phase a warm up period of one phase is given in between each configuration
change. This means that before each configuration change there is at least 15,000 instruc-
tions worth of data used to determine the next optimal configuration [10].
26
The B partition is placed in a drowsy state. This is done by setting the voltage on the cache
lines in the B partition to 0.7V instead of 0.9V. This voltage value was determined to be the
most optimal by Brendan Fitzgerald et al. in [10]. Whenever the B partition is accessed
it takes one cycle to raise the voltage back to 0.9V and read the data, meaning that each
time the B partition is accessed performance is reduced. Since 92% of cache accesses are
to MRU 0 and 98% of cache access are to MRU 0 or MRU 1 this does not incur a large
performance hit [16]. The majority of accesses are to partition A, therefore the majority of
the time, there will be no performance hit from accessing the drowsy cache. This allows for
a large amount of leakage savings when the B partition is large, with a small performance
hit.
This design takes advantage of the energy saving techniques used by the accounting cache
and phase adaptive cache to save dynamic power. It also saves leakage power by placing
the B partition in a drowsy state. This design, however, does affect performance. One addi-
tional cycle is needed to access data in the B partition and swap blocks between partitions.
There is also an additional one cycle latency to bring the B partition voltage back up to an
active state, to access data in the B partition. However, this design takes advantage of cache
locality and access patterns. When a majority of the data in the cache is accessed multiple
times, the energy savings is high and the performance penalties are small, making this an
ideal cache design for many applications.
3.1.1 Cost Functions
The ideal cache configuration in this phase adaptive cache is determined using different cost
functions. The generic cost function shown below was derived previously by Fitzgerald
[10]. This function is based off of the work presented by Dropsho et al. [7]. The cost is
27
based either on delay or energy. The general cost function is shown in Equation 3.1.
Cost = hitsA ∗ costA + hitsB ∗ costB +misses ∗ costmisses (3.1)
In this equation hitsA and hitsB are the number of hits in the A and B partitions. These are
calculated by summing the MRU counters for the ways that correspond to each partition.
The variable misses, is the number of misses in the cache level and costmisses is the cost
of accessing the next cache level. The cost variables represent the cost to either power,
latency, or both, of accessing the partition or incurring a miss in the partition. Since this
work is primarily looking at energy Equation 3.2 represents the energy cost.
(3.2)
EnergyCosti = hitsA ∗DynamicEnergyA + hitsB ∗DynamicEnergyB
+misses ∗ costmisses + LeakageEnergyA + LeakageEnergyB
+ swaps ∗ (DyamicEnergyA +DynamicEnergyB)
The delay cost function defined in [10] is shown below in Equation 3.3.
DelayCost = hitsA ∗ LatencyA + hitsB ∗ LatencyB +misses ∗ Latencymisses (3.3)
The total cost can be described as either the total energy cost or the total delay cost. These
equations have been derived for the L1, L2, and L3 cache configurations. They are appli-
cable for multithreaded and multicore simulations being done in this work.
3.2 Summary
This chapter discusses the overall design of the Drowsy Phase Adaptive Cache. Much of
this design reflects the previous work done by Brendan Fitzgerald in [10]. Overall this
design combines two cache designs, the phase adaptive cache, and the drowsy cache, to
produce a cache design that saves both static and dynamic power. The next chapter will
discuss the methodology used to implement this design for all three levels of the cache,
determine values used for simulations, as well as setting up the simulation environment.
28
Chapter 4
Methodology
To perform the proposed work multiple tools were needed as well as previous work. SPICE
simulations done previously were used to determine the drowsy voltage used in this work.
CACTI was used to perform hardware simulations for 32nm technology to determine the
appropriate latency and power values for each configuration’s cost function described in
Chapter 3. Finally the design was simulated using Multi2sim, which was modified to im-
plement this design.
4.1 SPICE
In the previous work described in [10], Brendan Fitzgerald et al. used SPICE to determine
the optimal drowsy voltage level of 0.7V. Figure 4.1 shows the circuit that was used for the
SPICE simulations. This circuit is designed to be state preserving. This ensures that no
data is lost when the cells are placed in a Drowsy state.
Figure 4.2 shows the SPICE simulation results. In this figure VDD begins set to 0.9V for
7ns. From there it falls to 0.7V at 7.33ns at a gradual slope. After reaching 0.7V, the voltage
stays there until 19ns, before it is raised to 0.9V. This voltage rise occurs over a period of
0.33ns. This plot ensures that the data can be kept at 0.7V and also can be read back out
after returning to 0.9V. These results, shown in [10], show that 0.7V is the appropriate
drowsy voltage for data preservation and that is the drowsy voltage used in this work.
29
Figure 4.1: SRAM Cell used for SPICE simulations [10]
Figure 4.2: SPICE Simulation Results
[10]
4.2 CACTI
The hardware parameters for this cache design need to be determined. This was done using
CACTI 6.5. This is a power and performance model for cache architectures [10]. It is used
30
to model access time, area, cycle time, as well as energy consumption. This makes CACTI
ideal for determining both latency and power parameters for this novel cache architecture.
This design is focused on all three cache levels therefore the hardware parameters for each
level need to be determined. For each cache level Uniformed Cache Access (UCA) was
used ensuring that access to each block of the cache would be the same. Also, for each
cache, there are two exclusive read ports and two exclusive write ports. Sequential access
mode is also used meaning that the tag array is accessed first and then the data array. This
reduces energy consumption in comparison to other access modes.
The CACTI source code is modified for the drowsy simulations. The drowsy simulations
require that the voltage used in the CACTI model is set to 0.7 volts instead of the normal
0.9 volts. To do this, a field in the CACTI source code must be modified to 0.7 volts, it
must be recompiled, and that version must be run for all drowsy models. This is the only
difference between the regular evaluations and the drowsy evaluations.
4.2.1 Latency
The latency for access to each partition needs to be determined. CACTI only allows for
associativities in powers of two, therefore linear interpolation is needed to determine the
latency values for configurations where the B partition is not a power of two. All of these
simulations are assuming a 3 GHz processor. These simulations are done for both the
drowsy case, where the voltage is reduced, and the normal case. Table 4.1 shows the
latency values for L1.
From there, this data is plotted and linear interpolation is used to determine an equation
for the latency, in relation to the associativity. Figure 4.3 shows this equation and plot. To
ensure that the best equation was found to represent the data, a coefficient of determination
31
Associativity Access Time (ns) Cycles
4 1.21E-09 3.63522
2 8.90E-10 2.66961
1 7.00E-10 2.100315
(a) Phase L1 Latency Results
Associativity Access Time (ns) Cycles
4 8.67E-10 2.6008599
2 7.01E-10 2.101683
1 5.41E-10 1.621656
(b) Drowsy L1 Latency Results
Table 4.1: CACTI latency results for L1
(Rˆ2), as close to one as possible, was found. This equation represents the equation for
latency for L1. Solving this equation for y where x is the associativity of each partition at a
given configuration results in the latency values shown in Table 4.2.
(a) Phase
(b) Drowsy
Figure 4.3: Plot and equation for L1 Latency
Next, the same process is followed for L2. Table 4.3 shows the results of the CACTI
32
Name A partition B partition
C0 1-way 3-way
2-cycle 3-cycle
C1 2-way 2-way
3-cycle 3-cycle
C2 4-way 0-way
4-cycle 0-cycle
Table 4.2: Possible configurations and latency of L1
simulations for L2. Figure 4.4 shows the equation and plot for L2 Latency.
Associativity Access Time (ns) Cycles
8 1.87E-09 5.60835
4 1.45E-09 4.34616
2 1.24E-09 3.73308
1 9.17E-10 2.750178
(a) Phase L2 Latency Results
Associativity Access Time (ns) Cycles
8 1.54E-09 4.62834
4 1.10E-09 3.30096
2 9.19E-10 2.756871
1 7.27E-10 2.1807
(b) Drowsy L2 Latency Results
Table 4.3: CACTI latency results for L2
Solving this equation for y where x is the associativity of each partition at any configuration,
results in the latency values shown in Table 4.4.
Name A partition B partition
C0 1-way 7-way
3-cycle 5-cycle
C1 2-way 6-way
4-cycle 4-cycle
C2 4-way 4-way
5-cycle 4-cycle
C3 8-way 0-way
5-cycle 0-cycle
Table 4.4: Possible configurations and Latency for L2
Finally, the same process is followed for L3. Table 4.5 shows the CACTI results for L3.
33
(a) Phase
(b) Drowsy
Figure 4.4: Plot and equation for L2 Latency
Figure 4.5 shows the equation and plot for L3 Latency.
Associativity Access Time (ns) Cycles
16 5.76E-09 17.27931
8 4.52E-09 13.55289
4 3.32E-09 9.97071
2 3.89E-09 11.66925
1 2.17E-09 6.49836
(a) Phase L3 Latency Results
Associativity Access Time (ns) Cycles
16 4.71E-09 14.14128
8 3.62E-09 10.86384
4 3.06E-09 9.17508
2 2.57E-09 7.72329
1 1.82E-09 5.4594
(b) Drowsy L3 Latency Results
Table 4.5: CACTI latency results for L3
34
(a) Phase
(b) Drowsy
Figure 4.5: Plot and equation for L3 Latency
Solving this equation for y where x is the associativity of each partition at any configuration
results in the latency values shown in Table 4.6.
Name A partition B partition
C0 1-way 15-way
7-cycle 14-cycle
C1 2-way 14-way
9-cycle 14-cycle
C2 4-way 12-way
12-cycle 13-cycle
C3 8-way 8-way
14-cycle 12-cycle
C4 16-way 0-way
16-cycle 0-cycle
Table 4.6: Possible configurations and Latency for L3
35
These latency values are used in the cost equations discussed previously, to determine over-
all performance of the system.
4.2.2 Energy and Power
Energy and power values are also needed for each cache configuration. Just like with the
latency values, linear interpolation is used to find the energy and power values for cache
configurations that are not a power of two. Both leakage power and dynamic power are
found for each level of the cache, for each of the possible configurations that are powers of
two. This is done for both the drowsy case, where the voltage is reduced, and the regular
case.
Table 4.7 shows the energy and power results for the L1 cache level. From there linear
interpolation is used to find an equation that can be used to solve for the energy values for
all of the configurations. Figure 4.6 show the results of the linear interpolation for these
values.
Associativity Dynamic (nJ) Leakage (mW)
1 0.0119956 4.03832
2 0.0211765 7.98129
4 0.0395421 15.8668
(a) Phase L1 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
1 0.00752034 3.1471
2 0.0134967 6.22004
4 0.0503375 19.8188
(b) Drowsy L1 Energy Results
Table 4.7: CACTI L1 Energy Results
Using the equations shown, to solve for the leakage and dynamic energy for each way,
produces the results shown in Table 4.8.
The energy and power values used for L2 were determined previously in [10] by Brendan
36
(a) Phase Dynamic Energy
(b) Phase Leakage Energy
(c) Drowsy Dynamic Energy
(d) Drowsy Leakage Energy
Figure 4.6: L1 Energy Plot and Equations
37
Associativity Dynamic (nJ) Leakage (mW)
1 0.012 4.0384
2 0.0212 7.9812
3 0.0304 11.924
4 0.0396 15.8668
(a) Phase L1 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
1 0.0074 3.146
2 0.0132 6.2155
3 0.0272 11.767
(b) Drowsy L1 Energy Results
Table 4.8: CACTI L1 Final Energy Results
Fitzgerald et al. Figure 4.7 shows the energy results and the equation used. Using the
equation shown the following energy numbers, shown in Table 4.9, are found.
Figure 4.7: L2 Energy Plot and Equations
Table 4.10 shows the energy and power results for the L3 cache level. From there, linear
interpolation is used to find an equation that can be used to solve for the energy values, for
all of the needed partitions. Figure 4.8 show the results of the linear interpolation for these
values. Using these equations to then solve for the leakage and dynamic energy for each
way produces the results shown in Table 4.11.
38
Associativity Dynamic (nJ) Leakage (mW)
1 0.0487877 17.85836
2 0.141499 56.9864
4 0.209668 133.9328
6 0.2735 260.6771385
7 0.3064 327.1112511
8 0.338417 403.5232
(a) Phase L2 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
4 0.0653267 74.92824
6 0.0851 146.0103765
7 0.0952 183.3090019
(b) Drowsy L2 Energy Results
Table 4.9: CACTI L2 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
1 0.43505 84.9223
2 0.607759 159.289
4 0.985836 319.192
8 1.52515 650.408
16 2.22933 1168.66
(a) Phase L3 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
1 0.270054 66.5918
2 0.382259 124.663
4 0.639253 248.755
8 0.932648 497.42
16 1.47787 944.256
(b) Drowsy L3 Energy Results
Table 4.10: CACTI L3 Energy Results
39
(a) Phase Dynamic Energy
(b) Phase Leakage Energy
(c) Drowsy Dynamic Energy
(d) Drowsy Leakage Energy
Figure 4.8: L3 Energy Plot and Equations
40
Associativity Dynamic (nJ) Leakage (mW)
1 0.435 76.5151
2 0.6208 163.2174
4 0.9642 330.7066
8 1.5382 642.0234
12 1.9618 921.7914
14 2.1172 1049.8446
15 2.1808 1110.9135
16 2.235 1170.0106
(a) Phase L3 Energy Results
Associativity Dynamic (nJ) Leakage (mW)
8 0.959 494.5554
12 1.2534 725.3514
14 1.3754 836.4618
15 1.4301 890.9451
(b) Drowsy L3 Energy Results
Table 4.11: CACTI L3 Final Energy Results
4.3 Multi2sim
Multi2sim is an application-only simulation framework for heterogeneous computing [20].
This simulator is highly configurable, allowing for multithreaded and multicore simula-
tions. It also has a highly configurable memory hierarchy allowing for simulations with
very customized cache configurations. This simulator has been used and modified to im-
plement the work described here.
Previous work has been done on this simulator as described in [10]. Brendan Fitzgerald
et al. added a new input parameter to turn phase adaptive behaviour on and off for each
configurable cache level. Also the simulator was modified to keep track of static and dy-
namic power consumption using the equations and results discussed previously. This was
implemented for L2 only. Variables were also added to keep track of the MRU counters,
accesses, misses, and swaps occurring throughout a simulation. MRU counters are used to
keep track of the hits to each partition, implementing the design discussed in [7]. When
41
there is a hit in a particular way, that MRU counter is incremented. Based on the current
configuration this MRU counter represents hits in either the A or B partition.
This modified version of Multi2sim 3.2 was used for the work described here. The first
portion of the simulator that needed to be modified was the MRU counters. There was a
bug in the previous modifications that resulted in an incorrect number of hits being recorded
for each MRU state. This just required ensuring the the MRU counters were updated each
and every time there was a hit in a particular cache level. From there, the phase adaptive
implementations for L1 and L3 were added. Care was taken to follow the coding standards
of both the original authors and the previous additions that were made.
These modifications required adding additional control algorithms to the portion of the
code that controls the cache structure, as described in [10]. This section determines the
next cache configuration based on the statistics for the just finished phase. In this case
a phase occurs every 15k instructions, as explained in Chapter 3. From there, it sets the
latency for the B partition and records the energy usage of the previous phase. For L1 and
L3 variables were added to collect the statistical information. Then the configuration costs
are determined using Equation 3.2 and Equation 3.3. Equation 4.1 is also used to convert
the leakage power values to leakage energy.
LeakageEnergy =
LeakagePower ∗ CycleCount
3GHz
(4.1)
After that step, the next configuration is determined by comparing the energy usage from all
of the possible configurations of the just finished phase. After that is complete the counters
and variables are reset and the simulation continues.
42
4.3.1 Simulation Configurations
The memory configuration was chosen to match the simulations done by Jorge Albericio
in [2] for comparison. The same base cache sizes and configurations were used for all sim-
ulations. The only modifications were which levels where phase adaptive and which were
not. These cache configurations are shown in Table 4.12. The CPU specific configuration
is shown in Table 4.13.
Parameter Value
General LRU, 64B line, 2 Read ports, 2 Write Ports
L1 Cache Split, 32KB, 4 way, 4 cycle
L2 Cache 256KB, 8 way, 5 cycle
L3 Cache 8MB, 16 way, 16 cycle
Table 4.12: Memory Configurations for All Simulations
Parameter Value
Fetch Queue 64 bytes
Decode Width 4 instructions
Branch Predictor Combined, 1024 entry BTB, 1024 entry Biomodal,
Two level 8K history table
Return Address Stack 16 entry
Issue & Commit Width 4 instructions
Reorder Buffer Size 129
Table 4.13: Processor Configuration
The architecture used for the multicore simulations is shown in Figure 4.9. There are a total
of two cores and four threads running. The configuration for each of the individual cache
levels is kept the same as shown in Table 4.12.
4.3.2 Benchmarks
The SPEC2006 benchmark suite [6] was used for all of the simulations. These benchmarks
are listed in Table 4.14. For the multithreaded and multicore simulations different combi-
nations of specific benchmarks were chosen. These combinations were chosen based on
43
Figure 4.9: Multiple Core Configuration [2]
the IPC performance of each of the different benchmarks shown in Table 4.15. The combi-
nations of these benchmarks simulate both mixes of high performance benchmarks as well
as benchmarks with low IPC performance. Mixes are also created based on the Misses Per
Kilo Instruction (MPKI) for both cache level 1 (MPKIL1) and level 2 (MPKIL2). Each
of these benchmarks will be run for a maximum number of cycles to allow for enough to
occur to show useful results.
4.4 Summary
This chapter outlines the methods used to implement the design outlined in Chapter 3. It
first outlines the previous SPICE simulations done in [10]. From there it describes how
CACTI is used to determine the hardware parameters for energy and latency, as well as
how Multi2sim is modified to simulate this design. The SPEC2006 benchmarks used, as
well as the simulation configurations are also described here. In the next chapter the results
of these simulations are shown.
44
Test Description
SPECINT
401.bzip Modified bzip2 to run in memory opposed to I/O
403.gcc Generates code based on GCC version 3.2
456.hmmer Protein sequence analysis using Markov models
459.sjeng An artificial intelligence program that plays chess
462.libquantum Simulates a quantum computer running Shor’s fac-
torization algorithm
471.omnetpp Models a large ethernet network using OMNet++
SPECFP
433.milc Quantum Chromodynamic simulator
434.zeusmp Fluid dynamic simulation of astrophysical phenom-
ena
436.cacatusADM Einstein evolution equation solver using staggered
leapfrog method
447.dealII Solves a Helmholtz-type equation with non-constant
coefficients
450.soplex Linear program simulator using a simplex algorithm
and sparse linear algebra
454.calculix Finite element code for linear and nonlinear 3D
structures
465.tonto Open source quantum chemistry package
470.lbm Simulates incompressible fluids in 3D
Table 4.14: Spec2006 Benchmarks Used [6]
Benchmark IPC MPKIL1 MPKIL2
mcf 0.58 102 5
hmmer 1.19 4 0.001
milc 0.66 19 9
dealII 1.6 4 0.1
lbm 0.43 57 26
Table 4.15: Spec2006 Benchmark Performance Numbers
45
Chapter 5
Results
The drowsy phase adaptive cache is designed to reduce dynamic energy consumption and
reduce leakage power consumption while incurring a small performance penalty. These
aspects will be shown and analyzed for each cache level implemented for single threaded,
multithreaded, and multicore architectures.
5.1 Experiments
The following experiments, shown in Table 5.1, have been performed for each of the bench-
marks shown in Table 4.14 and will be shown in the following figures.
Configuration name Description
Baseline The cache configuration was held unpartitioned
Phase The cache configuration is determined on the
MRU statistics, energy and leakage.
Drowsy The cache configuration is determined on the
MRU statistics, energy and leakage with the B partition
being put into the drowsy state.
PhaseED The cache configuration is determined on the MRU statistics
and the energy-delay product.
DrowsyED The cache configuration is determined on the
MRU statistics, and the energy-delay product with the
B partition being put into the drowsy state .
Table 5.1: Simulation Configurations and Descriptions
46
5.2 Single Threaded Results
The first set of results gathered is for single threaded simulations. First the simulations are
run with each cache level set to be phase adaptive independently. Finally, a simulation is
run with all three levels active together.
5.2.1 L1 Results
The following set of results are for a cache system with just a L1 phase adaptive cache.
5.2.1.1 Performance
Performance is one of the most important aspects of cache design and is generally used to
determine which designs are better than others. It is also easily quantifiable and therefore
a useful metric to judge a design. This section will first look at the time each simulation
spent in various configurations and then Speedup will be shown.
5.2.1.1.1 Configuration Time
The amount of time spent in each different configuration is very important to both power
and performance as shown previously with the latency and power values discussed for
each configuration. These configurations are shown in Table 4.2. Figure 5.1 shows the
percentage of time each simulation spent in each of the possible configurations.
For the Phase and PhaseED simulations, most of the benchmarks spend the majority of
their time in the first configuration, with the exception of hmmer. The first configuration,
C0, has the smallest A partition and largest B partition. In the Drowsy and DrowsyED cases
more time is spent in the second configuration. In this case there is no difference between
the energy-delay experiments and the regular experiments because the difference between
the delay cost is very minimal for each of the configurations, as shown in Table 4.2.
47
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure 5.1: L1 Config Distributions
5.2.1.1.2 Speedup and IPC
Figure 5.2 shows the Speedup for each of the benchmarks run with a drowsy phase adaptive
L1 cache. This is one of the most important aspects of the results. There are already other
techniques that reduce energy at the expense of performance, therefore if this technique
cannot have acceptable performance in relation to the baseline then it is no different.
It can be seen from these results that in general the Phase and PhaseED simulations have
higher performance than the Drowsy and DrowsyED cases. This can be explained using
the configuration times discussed previously. Since Phase and PhaseED spend most of
their time in the first configuration they incur less of an overall performance hit. It can also
be seen that, in this case, all of the benchmarks perform better than the original baseline
simulation. This is due to the locality of accesses to L1.
48
Figure 5.2: Speedup of L1 Simulations
Due to the nature of L1 accesses, a very large majority of accesses are to MRU 0. Figure
5.3 shows the MRU counter distribution for all the simulations. These figures show that a
large majority of accesses are to MRU 0. It also can be seen that hmmer has the highest
distribution of hits in other MRU states. This explains why the configuration time for this
case is different than the others. This MRU distribution means that the majority of hits will
occur in the A partition. This saves on performance because access to the B partition is not
necessary.
49
(a) Phase MRU Distribution
(b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution
(d) DrowsyED MRU Distribution
Figure 5.3: L1 MRU Distributions
5.2.1.2 Energy and Power
The goal of this work is to reduce energy and power consumption while having a low per-
formance impact. As shown previously, for L1, the drowsy phase adaptive cache actually
50
performs better than the Baseline case.
5.2.1.2.1 Dynamic Energy
The phase adaptive cache lowers dynamic energy by reducing the number of cache ways
that are being accessed at any given time. The total dynamic energy savings is shown in
Figure 5.4. All of the experiments show promising dynamic energy savings. The savings
ranges from 22% at the lowest to 48% at the highest. This energy savings is due to the
configuration times and the MRU distributions. Since the experiments spend all of their
time in either the first configuration or the second configuration, dynamic energy is saved
with each access. Since most of the accesses are to the A partition. the extra switching
overhead is not incurred, therefore that allows for more savings.
Figure 5.4: Total Dynamic Energy Savings of L1 Simulations
5.2.1.2.2 Leakage Energy
Leakage energy is a large portion of the overall energy consumption in a processor, there-
fore leakage savings is an important aspect of this work. Figure 5.5 shows the total leakage
savings. There is leakage savings in both the Drowsy and DrowsyED cases. This savings
51
is small ranging from 4.5% to 16.4%. The leakage savings in this case is small because of
the small size of the L1 cache. The leakage energy values for the different configurations
do not vary heavily as shown in Table 4.8. This means that, even though leakage energy is
saved with each access to the A partition, only a small overall percentage is saved.
Figure 5.5: Total Leakage Energy of L1 Simulations
5.2.1.2.3 Total Energy
The total amount of energy savings is shown in Figure 5.6. This figure is very similar to
Figure 5.4 because the dynamic energy savings dominates the overall energy savings. This
is not surprising due to the small amount of difference in the leakage values for each of the
drowsy configurations. Overall the results presented here are as expected and show high
dynamic energy savings for L1.
52
Figure 5.6: Total Energy Savings of L1 Simulations
5.2.2 L2 Results
The following set of results are for a cache system with just a L2 phase adaptive cache.
5.2.2.1 Performance
5.2.2.1.1 Configuration Time
Figure 5.7 shows the percentage of time spent in each possible configuration. For the
Phase and PhaseED configurations there is a lot of configuration activity. The PhaseED
configuration has more activity in C3 than the Phase configuration in some cases. This
is due to the added consideration of the delay cost in the overall cost equation. For the
Drowsy configuration most of the configuration time is spent in the first configuration. The
DrowsyED configuration has more activity in configurations C1 and C2 than the Drowsy.
This is due to added delay cost in the DrowsyED configuration.
53
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure 5.7: L2 Config Distributions
5.2.2.1.2 Speedup and IPC
Figure 5.8 shows the Speedup for each of the benchmarks run with a drowsy phase adaptive
L2 cache. It can be seen from these results that, for L2, there is a performance hit in most
cases however this impact is very small. In the best case the performance is the same as the
baseline. In the worst case the performance is 97.9%, which is only a 2.1% performance
hit.
54
Figure 5.8: Speedup of L2 Simulations
These performance results are not as good as L1. They can be explained by looking at
the MRU distribution for the L2 accesses. Figure 5.9 shows the MRU distribution for all
of the configurations. The MRU distribution shows a lot of activity in the first few MRU
states. Unlike in L1, where a large majority of the hits were in MRU 0, the hits are more
distributed for L2. In some of the cases the majority of hits are not in MRU 0. These are
the simulations with the lowest performance. This is why there is more variation in the
configuration times than in L1. There will be more hits to the B partition in these cases
which will incur a higher delay cost.
55
(a) Phase MRU Distribution
(b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution
(d) DrowsyED MRU Distribution
Figure 5.9: L2 MRU Distributions
5.2.2.2 Energy and Power
5.2.2.2.1 Dynamic Energy
The total dynamic energy savings is shown in Figure 5.10. It can be seen from these figures
that the majority of the cases have decent energy savings. The Drowsy and DrowsyED cases
56
have the most energy savings due to the lower energy numbers. In a few cases the PhaseED
configurations do not have energy savings. This is due to the addition of the delay cost to
the overall cost equation. Overall the energy savings ranges from -3% to 35%. The -3%
can be accounted for by the occasional extra cost for access to the B partition. The worst
case should be equal to the Baseline configuration, however since the configurations are
only updated every 15k instructions it is possible to see a small negative savings.
Figure 5.10: Total Dynamic Energy Savings of L2 Simulations
5.2.2.2.2 Leakage Energy
Figure 5.11 shows the total leakage savings in reference to the Baseline simulation. There
is a promising amount of leakage savings shown for the Drowsy and DrowsyED cases,
with savings ranging from 43.9% to 45.3%. The leakage savings here is higher than in the
previous L1 case because the L2 cache is larger and therefore there is a larger difference in
the leakage energy for the drowsy configurations. These values are shown in Table 4.8.
57
Figure 5.11: Total Leakage Energy of L2 Simulations
5.2.2.2.3 Total Energy
The total amount of energy savings is shown in Figure 5.12. This figure is very similar to
Figure 5.11 because the leakage energy savings dominates the overall energy savings. This
is as expected since the leakage energy makes up a large portion of the overall energy con-
sumption. Overall the results presented here are as expected and show promising leakage
energy savings.
Figure 5.12: Total Energy Savings of L2 Simulations
58
5.2.3 L3 Results
The following set of results are for a cache system with just a L3 phase adaptive cache.
5.2.3.1 Performance
5.2.3.1.1 Configuration Time
Figure 5.13 shows the percentage of time spent in each possible configuration. For all of
the simulations there is a lot of configuration activity. The Drowsy simulations show the
most time spent in C0 as expected. The DrowsyED simulations show more variation than
the Drowsy simulations due to the addition of delay to the optimization equation. This is a
much bigger cache than L1 and L2 so more activity is expected.
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure 5.13: L3 Config Distributions
59
5.2.3.1.2 Speedup and IPC
Figure 5.14 shows the Speedup for each of the benchmarks run with a drowsy phase adap-
tive L3 cache. It can be seen from these results that, for L3, the performance improves
for some cases and for others there is a low performance hit. In the best case the perfor-
mance 103%, while in the worst case the performance is 99.5%. This is less than a 1%
performance hit in some cases while others show higher performance.
Figure 5.14: Speedup of L3 Simulations
These performance results for L3 show an acceptable performance hit in some cases and a
performance improvement in other cases. They can be explained by looking at the MRU
distribution for the L3 accesses. Figure 5.15 shows the MRU distribution for all of the
configurations. The MRU distribution shows the majority of accesses are to MRU 0. The
performance results for L2 were not as high as for L3, however the MRU distribution
showed more activity in the higher states than these here. This provides some insight into
why L3 performs better. Also, as stated previously, L3 is larger than L2 therefore, for every
hit in the A partition, there is a larger overall savings in performance.
60
(a) Phase MRU Distribution
(b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution
(d) DrowsyED MRU Distribution
Figure 5.15: L3 MRU Distributions
5.2.3.2 Energy and Power
5.2.3.2.1 Dynamic Energy
The total dynamic energy savings is shown in Figure 5.16. These results show that for
many cases there is significant energy savings. The total energy savings ranges from 10%
at the lowest to 61.5% at the highest. In most cases the Drowsy case actually has the
lowest dynamic energy savings in comparison to the other cases. This can be explained
by looking at the configuration times. This is due to the difference in the dynamic energy
61
numbers for L3. These numbers are shown in Table 4.11. Based on the configurations
for L3, shown in Table 4.6, the B partition for L3 remains fairly large. For this reason,
the difference between the dynamic energy numbers for the Phase and Drowsy cases are
relatively small. Also the differences the dynamic energy values for the different sized B
partitions are small. For DrowsyED, once the delay cost is added into the equation, C0 and
the smaller configurations are more cost effective.
Figure 5.16: Total Dynamic Energy Savings of L3 Simulations
5.2.3.2.2 Leakage Energy
Figure 5.17 shows the total leakage savings in reference to the Baseline simulation. There
is leakage savings in all cases however it is not as large as for L2. The savings for L3
ranges from 11.8% to 16.6%. These energy savings can be explained by the drowsy leakage
energy values. Since L3 is so large and the B partition is so large the difference between
the Drowsy leakage values and the Phase leakage values is relatively small.
62
Figure 5.17: Total Leakage Energy of L3 Simulations
5.2.3.2.3 Total Energy
The total amount of energy savings is shown in Figure 5.18. This figure is similar to
Figure 5.17 because the leakage energy savings dominates the overall energy savings. This
is as expected since the leakage energy makes up a large portion of the overall energy
consumption.
Figure 5.18: Total Energy Savings of L3 Simulations
63
5.2.4 L1, L2, L3 Single Threaded
The following set of results are for a cache system with a drowsy phase adaptive L1, L2,
and L3.
5.2.4.1 Performance
5.2.4.1.1 Configuration Time
Figure 5.19 shows the percentage of time spent in each possible configuration for the L1
cache level. These results are as expected. Most of the configuration time is spent in either
C0 or C1 due to the high level of locality in L1 accesses. This is as expected and similar to
the results shown previously for just L1.
(a) L1 Phase Config Time (b) L1 PhaseED Config Time
(c) L1 Drowsy Config Time (d) L1 DrowsyED Config Time
Figure 5.19: L1 Config Distributions
Figure 5.20 shows the percentage of time spent in each possible configuration for the L2
64
cache level. These results are as expected. There is a lot of configuration variation for the
Phase and PhaseED simulations. The Drowsy simulation spends most of its time in C0 to
optimize for the most energy savings and the DrowsyED simulation spends time in the first
few configurations to optimize for both energy and delay. This is as expected and similar
to the results shown previously for just L2.
(a) L2 Phase Config Time (b) L2 PhaseED Config Time
(c) L2 Drowsy Config Time (d) L2 DrowsyED Config Time
Figure 5.20: L2 Config Distributions
Figure 5.21 shows the percentage of time spent in each possible configuration for the L3
cache level. These results are as expected. There is a lot of configuration variation for the
Phase and PhaseED simulations. The Drowsy simulation spends most of its time in C0 to
optimize for the most energy savings and the DrowsyED simulation spends time in the first
few configurations to optimize for both energy and delay. This is as expected and similar
to the results shown previously for just L3.
65
(a) L3 Phase Config Time (b) L3 PhaseED Config Time
(c) L3 Drowsy Config Time (d) L3 DrowsyED Config Time
Figure 5.21: L3 Config Distributions
5.2.4.1.2 Speedup and IPC
Figure 5.22 shows the Speedup for each of the benchmarks run with a drowsy phase adap-
tive L1, L2, and L3 cache. It can be seen from these results that the performance improves
or is the same as the baseline for all cases. The overall performance ranges from 100% to
118%. This is as expected. Individually there is a large performance gain from having just
a phase adaptive L1. There is a very small performance hit for L2 and on average either
a gain or a small hit for L3. Overall, with all three cache levels working together, it is
expected that there would be a small performance gain.
66
Figure 5.22: Speedup
5.2.4.2 Energy and Power
5.2.4.2.1 Dynamic Energy
The total dynamic energy savings for L1 is shown in Figure 5.23. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
13.9% at the lowest to 56.1% at the highest. The Drowsy and DrowsyED simulations show
the most dynamic energy savings as expected.
67
Figure 5.23: Total Dynamic Energy Savings of L1 Cache
The total dynamic energy savings for L2 is shown in Figure 5.24. These results show
savings for the Drowsy and DrowsyED cases, however they show more energy consumption
for Phase and PhaseED cases. The energy savings ranges from -29.6% to 38.6%. This
differs greatly from the results that were seen when L2 is the only active cache level.
Figure 5.24: Total Dynamic Energy Savings of L2 Cache
68
These results can be explained by looking at the cost equations. Equation 3.1 shows that a
factor in the overall cost function is the cost of a miss. In this case a cost of a miss in L2
is the cost of a single read to L3. In this work the worst case scenario is used to calculate
the cost of a single read to L3. When L3 is phase adaptive this cost can be represented by
Equation 5.1.
ReadCost = costA + costB (5.1)
The cost in this equation is the dynamic energy. Table 5.2 shows the access costs for L3 for
each of different configurations. The partition costs used in this table can be found in Table
4.11. These results show that for L3 all of the configurations have a higher cost than C4. C4
is the configuration that matches the baseline case. This means that, whenever L3 is in any
configuration other than C4, the misscost for L2 is higher than it is for the baseline case.
This, therefore, provides the possibility for L2 to actually consume more total dynamic
energy than the baseline configuration. In the Drowsy and DrowsyED configurations the
lower energy values for both L2 and L3 are enough to overcome this, therefore resulting in
positive energy savings.
Configuration Cost A partition (nJ) Cost B partition (nJ) Total Cost (nJ)
C0 0.435 2.1808 2.6151
C1 0.6208 2.1172 2.738
C2 0.9642 1.9618 2.926
C3 1.5382 1.5382 3.0764
C4 2.235 0 2.235
Table 5.2: Access Costs for L3 Configurations
The total dynamic energy savings for L3 is shown in Figure 5.25. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
10% at the lowest to 63.1% at the highest. These results are as expected and very similar
to the results stated previously for the case when just L3 is active.
69
Figure 5.25: Total Dynamic Energy Savings of L3 Cache
5.2.4.2.2 Leakage Energy
The total leakage energy savings for L1 is shown in Figure 5.26. These results show that
there is leakage savings for all cases, however this savings is small ranging from 4.3% to
18.1%. This is as expected and consistent with the L1 results shown previously.
Figure 5.26: Total Leakage Energy of L1 Cache
The total leakage energy savings for L2 is shown in Figure 5.27. These results show that
70
there is large amount of leakage energy savings for L2, ranging from 44.8% to 53.6%. This
is as expected and consistent with the L2 results shown previously.
Figure 5.27: Total Leakage Energy of L2 Cache
Figure 5.28 shows the total leakage savings in reference to the Baseline simulation for L3.
The savings for L3 ranges from 16% to 30.8%. This leakage savings is larger than L1 and
smaller than L2. This is as expected and shows a similar trend as presented in the previous
results for L3 running independently.
Figure 5.28: Total Leakage Energy of L3 Cache
71
5.2.4.2.3 Total Energy
The total amount of energy savings for L1 is shown in Figure 5.29. This figure is very
similar to Figure 5.23 because the dynamic energy savings dominates the overall energy
savings because both the leakage energy and dynamic energy are very small. This is as
expected and consistent with previous results.
Figure 5.29: Total Energy Savings of L1 Cache
The total amount of energy savings for L2 is shown in Figure 5.30. This figure is very
similar to Figure 5.27 because the leakage energy savings dominates the overall energy
savings for L2. This is also expected and consistent with previous results.
72
Figure 5.30: Total Energy Savings of L2 Cache
The total amount of energy savings for L3 is shown in Figure 5.31. This result combines
the results in Figure 5.25 and Figure 5.28 because the total energy savings for L3 is shared
between both dynamic and leakage, with the leakage energy dominating overall. This is
also expected and consistent with previous results.
Figure 5.31: Total Energy Savings of L3 Cache
73
5.3 Multithreaded Results
The second set of results gathered is for multithreaded simulations. First the simulations
are run with each cache level being phase adaptive independently. These results are shown
in Appendix A. Then there are some additional simulations with different combinations of
cache levels set to be phase adaptive. Finally a simulation is run with all three levels active
together. These simulations include simulations with two threads and four threads.
5.3.1 L1 and L2 Multithreaded Results
The following set of results are for a multithreaded cache system with a phase adaptive L1
and L2 cache.
5.3.1.1 Performance
5.3.1.1.1 Configuration Time
Figure 5.32 shows the percentage of time spent in each possible configuration for L1. For
the Phase and PhaseED configurations there is time spent in all of the configurations. For
the Drowsy and DrowsyED configuration most of the configuration time is spent in the first
two configurations. These configuration times are as expected and match the results shown
for the multithreaded cases with just a phase adaptive L1.
74
(a) L1 Phase Config Time (b) L1 PhaseED Config Time
(c) L1 Drowsy Config Time (d) L1 DrowsyED Config Time
Figure 5.32: L1 Config Distributions
Figure 5.33 shows the percentage of time spent in each possible configuration for L2. For
the Phase and PhaseED configurations there is a lot of time spent in the C2 configuration.
For the Drowsy configuration most of the configuration time is spend in the first configura-
tion. The DrowsyED configuration has more activity in configurations C1 and C2 than the
Drowsy. This is due to added delay cost in the DrowsyED configuration. These configu-
ration times are as expected and match the results shown for the multithreaded cases with
just a phase adaptive L2.
75
(a) L2 Phase Config Time (b) L2 PhaseED Config Time
(c) L2 Drowsy Config Time (d) L2 DrowsyED Config Time
Figure 5.33: L2 Config Distributions
5.3.1.1.2 Speedup and IPC
Figure 5.34 shows the Speedup for each of the benchmarks run with a phase adaptive L1
and L2 cache. It can be seen from these results that in all cases the performance is better
than the baseline. The performance ranges from 101% to 105%. In general the Phase and
PhaseED simulations have higher speedup than the Drowsy and DrowsyED simulations.
This is as expected due to the added performance hit incurred when the cache is set to
drowsy. These results are as expected considering that both L1 and L2 are phase adaptive.
The speedup is slightly smaller than the speedup when just L1 is phase adaptive. This is
due to the slight performance hit incurred by setting L2 to be phase adaptive.
76
Figure 5.34: Speedup
5.3.1.2 Energy and Power
5.3.1.2.1 Dynamic Energy
The total dynamic energy savings for L1 is shown in Figure 5.35. These results show that
there is dynamic energy savings for all of the simulations. The savings ranges from 4.0%
to 50.5% for the simulations with two threads. It ranges from 59.7% to 73.4% for the
simulations with four threads. In most cases the Drowsy and DrowsyED cases have higher
dynamic energy savings. This is because of the slightly smaller energy values. These
results are as expected for L1.
Figure 5.35: Total Dynamic Energy Savings of L1 Cache
77
The total dynamic energy savings for L2 is shown in Figure 5.36. It can be seen from these
figures that the simulations with two threads have small energy savings while the simulation
with four threads have high energy savings. The energy savings for the simulations with
two threads range from -1.4% to 13.3%, while the simulation with four threads has savings
around 46.2%. In general the Drowsy and DrowsyED simulations have the most energy
savings, as expected. These results are as expected for L2 and match up with previous
results.
Figure 5.36: Total Dynamic Energy Savings of L2 Cache
5.3.1.2.2 Leakage Energy
Figure 5.37 shows the total leakage savings for L1. There is leakage savings in both the
Drowsy and DrowsyED cases. This savings is very small ranging from 2.1% to 7.5% for
the simulations with two threads. The savings is much higher for the simulation with four
threads; around 56% on average. The leakage savings for L1 is consistent with the savings
shown before for simulations with just a phase adaptive L1.
78
Figure 5.37: Total Leakage Energy of L1 Cache
Figure 5.38 shows the total leakage savings for L2 in reference to the Baseline simulation.
There is a promising amount of leakage savings shown for the Drowsy and DrowsyED
cases, with savings ranging from 44.5% to 45.3% for the cases with two threads. For the
case with four threads there is a savings of around 73%. These results are as expected and
match the results for L2 when it is the only phase adaptive cache level.
Figure 5.38: Total Leakage Energy of L2 Cache
5.3.1.2.3 Total Energy
The total amount of energy savings for L1 is shown in Figure 5.39. The total amount of
energy savings for L2 is shown in Figure 5.40. Both of these results show that leakage
79
power dominates the overall energy savings for both cache levels. Overall the results with
both a phase adaptive L1 and L2 are as expected. The overall savings for each cache level
remains very similar to the results for each level set as phase adaptive independently.
Figure 5.39: Total Energy Savings of L1 Cache
Figure 5.40: Total Energy Savings of L2 Cache
5.3.2 L1 and L3 Multithreaded Results
The following set of results are for a multithreaded cache system with a phase adaptive L1
and L3 cache.
80
5.3.2.1 Performance
5.3.2.1.1 Configuration Time
Figure 5.41 shows the percentage of time spent in each possible configuration for L1. For
the Phase and PhaseED configurations there is time spent in all of the configurations. For
the Drowsy and DrowsyED configuration most of the configuration time is spent in the first
two configurations. These configuration times are as expected and match the results shown
for the multithreaded cases with just a phase adaptive L1.
(a) L1 Phase Config Time (b) L1 PhaseED Config Time
(c) L1 Drowsy Config Time (d) L1 DrowsyED Config Time
Figure 5.41: L1 Config Distributions
Figure 5.42 shows the percentage of time spent in each possible configuration for L3. For
all of the simulations there is a lot of configuration activity. The Drowsy simulations show
81
the most time spent in C0, as expected. The DrowsyED simulations show more varia-
tion than the Drowsy simulations due to the addition of delay to the optimization equa-
tion. These configuration times are as expected and match the results shown for the multi-
threaded cases with just a phase adaptive L3.
(a) L3 Phase Config Time (b) L3 PhaseED Config Time
(c) L3 Drowsy Config Time (d) L3 DrowsyED Config Time
Figure 5.42: L3 Config Distributions
5.3.2.1.2 Speedup and IPC
Figure 5.43 shows the speedup for each of the benchmarks run with a phase adaptive L1 and
L3 cache. It can be seen from these results that in all cases the performance is better than
the baseline. The performance ranges from 98.5% to 107.9%. These results are as expected
considering that both L1 and L3 are phase adaptive. The speedup is slightly smaller than the
speedup when just L1 is phase adaptive. This is due to the slight performance hit incurred
by setting L3 to be phase adaptive.
82
Figure 5.43: Speedup
5.3.2.2 Energy and Power
5.3.2.2.1 Dynamic Energy
The total dynamic energy savings for L1 is shown in Figure 5.44. These results show that
there is dynamic energy savings for all of the simulations. The savings ranges from 12.2%
to 19% for the simulations with two threads. It is about from 62% for the simulations with
four threads. In most cases the Drowsy and DrowsyED cases have higher dynamic energy
savings. This is because of the slightly smaller energy values. These results are as expected
for L1 and match the individual L1 results for multithreaded simulations.
Figure 5.44: Total Dynamic Energy Savings of L1 Cache
83
The total dynamic energy savings for L3 is shown in Figure 5.45. It can be seen from these
figures that the simulations with two threads have small energy savings while the simulation
with four threads have high energy savings. The energy savings for the simulations with
two threads range from -19.8% to 27.9% while the simulation with four threads has savings
ranging from 10% to 88%. The mcf-hmmer has seemingly large negative energy savings
however this simulation has very low dynamic energy. The dynamic energy consumption
is only slightly higher, however it accounts for a larger percentage due to the magnitude
of these values. These results are very similar to the results previously gathered for an L3
phase adaptive cache running independently.
Figure 5.45: Total Dynamic Energy Savings of L3 Cache
5.3.2.2.2 Leakage Energy
Figure 5.46 shows the total leakage savings for L1. There is leakage savings in both the
Drowsy and DrowsyED cases. This savings is very small ranging from 1.9% to 9.3% for
the simulations with two threads. The savings is much higher for the simulation with four
threads; around 56% on average. The leakage savings for L1 is consistent with the savings
shown for simulations with just L1 being active.
84
Figure 5.46: Total Leakage Energy of L1 Cache
Figure 5.47 shows the total leakage savings for L3 in reference to the Baseline simulation.
There is a small amount of leakage savings shown for the Drowsy and DrowsyED cases,
with savings ranging from 8.8% to 16.3% for the cases with two threads. For the case with
four threads there is a savings of around 58%. These results are as expected and match the
results for L3 when it was the only phase adaptive cache level.
Figure 5.47: Total Leakage Energy of L3 Cache
5.3.2.2.3 Total Energy
The total amount of energy savings for L1 is shown in Figure 5.48. The total amount of
energy savings for L3 is shown in Figure 5.49. The resutls for L3 show that leakage power
85
dominates the overall energy savings while for L1 both dynamic and leakage energy play
a significant role. This is due to the small size of L1 and the therefore small energy values
for both leakage energy and dynamic energy. Overall the results with both a phase adaptive
L1 and L3 are as expected. The overall savings for each cache level remains very similar
to the results for each level independently.
Figure 5.48: Total Energy Savings of L1 Cache
Figure 5.49: Total Energy Savings of L3 Cache
5.3.3 L2 and L3 Multithreaded Results
The following set of results are for a multithreaded cache system with a phase adaptive L2
and L3 cache.
86
5.3.3.1 Performance
5.3.3.1.1 Configuration Time
Figure 5.50 shows the percentage of time spent in each possible configuration for L2. For
the Phase and PhaseED configurations there is a lot of time spent in the C2 configuration.
For the Drowsy configuration most of the configuration time is spent in the first configura-
tion. The DrowsyED configuration has more activity in configurations C1 and C2 than the
Drowsy. This is due to added delay cost in the DrowsyED configuration. These configu-
ration times are as expected and match the results shown for the multithreaded cases with
just a phase adaptive L2.
(a) L2 Phase Config Time (b) L2 PhaseED Config Time
(c) L2 Drowsy Config Time (d) L2 DrowsyED Config Time
Figure 5.50: L2 Config Distributions
Figure 5.42 shows the percentage of time spent in each possible configuration for L3. For
all of the simulations there is a lot of configuration activity. The Drowsy simulations show
87
the most time spent in C0 as expected. The DrowsyED simulations show more varia-
tion than the Drowsy simulations due to the addition of delay to the optimization equa-
tion. These configuration times are as expected and match the results shown for the multi-
threaded cases with just a phase adaptive L3.
(a) L3 Phase Config Time (b) L3 PhaseED Config Time
(c) L3 Drowsy Config Time (d) L3 DrowsyED Config Time
Figure 5.51: L3 Config Distributions
5.3.3.1.2 Speedup and IPC
Figure 5.52 shows the speedup for each of the benchmarks run with a phase adaptive L2
and L3 cache. It can be seen from these results that there is very small performance hit for
each simulation. The performance ranges from 96.9% at the worse case to 99.9% at the
best case. These results are as expected considering that both L2 and L3 are phase adaptive.
The speedup is slightly smaller than the speedup with just L3 in some of the cases, but this
is due to the added performance hit incurred when L2 is phase adaptive.
88
Figure 5.52: Speedup
5.3.3.2 Energy and Power
5.3.3.2.1 Dynamic Energy
The total dynamic energy savings for L2 is shown in Figure 5.53. It can be seen from these
figures that the simulations with two threads have small energy savings while the simulation
with four threads have higher energy savings. The energy savings for the simulations with
two threads range from -12.3% to 25.3% while the simulation with four threads has savings
around 46.2%. In general the Drowsy and DrowsyED simulations have the most energy
savings, as expected. There are negative results for the Phase and PhaseED cases. This is
due the additional misscost for L2 when L3 is phase adaptive. This is described previously
in the single threaded results where both L2 and L3 are phase adaptive. The same cost
equations and energy numbers apply here as well. This accounts for the additional energy
consumption. These results are as expected for L2 and match up with previous results.
89
Figure 5.53: Total Dynamic Energy Savings of L2 Cache
The total dynamic energy savings for L3 is shown in Figure 5.54. It can be seen from these
figures that the simulations with two threads have small energy savings while the simulation
with four threads have high energy savings. The energy savings for the simulations with
two threads range from -7.5% to 34.7% while the simulation with four threads has savings
ranging from 9% to 54%. These results are very similar to the results previously gathered
for an L3 phase adaptive cache running independently.
Figure 5.54: Total Dynamic Energy Savings of L3 Cache
5.3.3.2.2 Leakage Energy
Figure 5.55 shows the total leakage savings for L2. There is leakage savings in both the
Drowsy and DrowsyED cases. This savings is very promising, ranging from 44.6% to
90
45.3% for the simulations with two threads and round 72% for the simulations with four
threads. The leakage savings for L2 is consistent with the savings shown before for simu-
lations with a phase adaptive L2 cache.
Figure 5.55: Total Leakage Energy of L2 Cache
Figure 5.56 shows the total leakage savings for L3 in reference to the Baseline simulation.
There is a small amount of leakage savings shown for the Drowsy and DrowsyED cases,
with savings ranging from 9.0% to 16.3% for the cases with two threads. For the case with
four threads there is a savings of around 58%. These results are as expected and match the
results for L3 when it was the only phase adaptive cache level.
Figure 5.56: Total Leakage Energy of L3 Cache
91
5.3.3.2.3 Total Energy
The total amount of energy savings for L2 is shown in Figure 5.57. The total amount of
energy savings for L3 is shown in Figure 5.58. Both of these results show that leakage
power dominates the overall energy savings for both cache levels. Overall the results with
both a phase adaptive L2 and L3 are as expected. The overall savings for each cache level
remains very similar to the results for each level independently.
Figure 5.57: Total Energy Savings of L2 Cache
Figure 5.58: Total Energy Savings of L3 Cache
92
5.3.4 L1, L2, L3 Multithreaded Results
The following set of results are for a multithreaded cache system with a phase adaptive
L1, L2, and L3 cache. Additional simulations with four threads are added in this case for
further analysis. Table 5.3 shows the names for these simulations seen in the following
figures and the corresponding benchmarks used.
Name Thread 1 Thread 2 Thread 3 Thread 4
four lbm milc hmmer dealII
four2 lbm lbm lbm lbm
four3 lbm dealII lbm dealII
four4 lbm lbm milc milc
Table 5.3: Experiments with Four Threads
5.3.4.1 Performance
5.3.4.1.1 Configuration Time
Figure 5.59 shows the percentage of time spent in each possible configuration for the L1
cache level. These results are as expected. Most of the configuration time is spent in either
C0 or C1 due to the high level of locality in L1 accesses. This is as expected and similar to
the results shown previously for just L1 in both the mulithreaded and single threaded cases.
93
(a) L1 Phase Config Time (b) L1 PhaseED Config Time
(c) L1 Drowsy Config Time (d) L1 DrowsyED Config Time
Figure 5.59: L1 Config Distributions
Figure 5.60 shows the percentage of time spent in each possible configuration for the L2
cache level. There is a lot of configuration variation for the Phase and PhaseED simu-
lations. The Drowsy simulation spends most of its time in C0 to optimize for the most
energy savings and the DrowsyED simulation spends time in the first few configurations to
optimize for both energy and delay. This is as expected and similar to the results shown
previously for just L2 in both the mulithreaded and single threaded cases.
94
(a) L2 Phase Config Time (b) L2 PhaseED Config Time
(c) L2 Drowsy Config Time (d) L2 DrowsyED Config Time
Figure 5.60: L2 Config Distributions
Figure 5.61 shows the percentage of time spent in each possible configuration for the L3
cache level. There is a lot of configuration variation for the Phase and PhaseED simu-
lations. The Drowsy simulation spends most of its time in C0 to optimize for the most
energy savings and the DrowsyED simulation spends time in the first few configurations to
optimize for both energy and delay. This is as expected and similar to the results shown
previously for just L3 in both the mulithreaded and single threaded cases.
95
(a) L3 Phase Config Time (b) L3 PhaseED Config Time
(c) L3 Drowsy Config Time (d) L3 DrowsyED Config Time
Figure 5.61: L3 Config Distributions
5.3.4.1.2 Speedup and IPC
Figure 5.62 shows the Speedup for each of the benchmarks run with a drowsy phase adap-
tive L1, L2, and L3 cache. It can be seen from these results that the performance improves
in some cases and incurs a small penalty in others. The speedup ranges from 95.2% to
108.1%. This is as expected. Individually there is a large performance gain from having
just L1 be phase adaptive. There was a very small performance hit for L2 and on average
either a gain or a small hit for L3. Overall, with all three cache levels working together, it
is expected that there would be a mix of a small performance gain and a small performance
hit.
96
Figure 5.62: Speedup
5.3.4.2 Energy and Power
5.3.4.2.1 Dynamic Energy
The total dynamic energy savings for L1 is shown in Figure 5.63. These results show
that for many cases there is significant energy savings. The total energy savings ranges
from -43.4% to 58.7%. The Drowsy and DrowsyED simulations show the most dynamic
energy savings as expected. One of the simulations, four2, has a rather large negative
energy savings for the Phase and PhaseED cases. This can be explained by looking at the
L1 misscost. Just as previously with L2, the misscost for L1 is the cost of an L2 access.
Again, just described previously for L2 and L3, this cost for L2 in the Phase and PhaseED
cases is higher for all phase configurations that are not equal to the baseline. This allows
for the possibility of higher energy consumption in some simulations. The lower energy
values for L2 in the Drowsy and DrowsyED cases overcome this possibility.
97
Figure 5.63: Total Dynamic Energy Savings of L1 Cache
The total dynamic energy savings for L2 is shown in Figure 5.64. These results show
savings for the Drowsy and DrowsyED cases, however they show more energy consumption
for Phase and PhaseED cases. The energy savings ranges from -18.3% to 46.2%. This
matches the results shown previously for L2, when L3 is also phase adaptive.
Figure 5.64: Total Dynamic Energy Savings of L2 Cache
The total dynamic energy savings for L3 is shown in Figure 5.65. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
-15% at the lowest to 54% at the highest. These results are as expected and very similar to
the results stated previously for the mulithreaded case when just L3 is active.
98
Figure 5.65: Total Dynamic Energy Savings of L3 Cache
5.3.4.2.2 Leakage Energy
The total leakage energy savings for L1 is shown in Figure 5.66. These results show that
there is leakage savings for all cases however this savings is generally small ranging from
4.3% to 56.4%. This is as expected and consistent with the L1 results shown previously.
Figure 5.66: Total Leakage Energy of L1 Cache
The total leakage energy savings for L2 is shown in Figure 5.67. These results show that
there is large amount of leakage energy savings for L2, ranging from 44.5% to 69.7%. This
is as expected and consistent with the L2 results shown previously.
99
Figure 5.67: Total Leakage Energy of L2 Cache
Figure 5.68 shows the total leakage savings in reference to the Baseline simulation for L3.
The savings for L3 ranges from 14.9% to 61.2%. This leakage savings is as expected and
matches previously presented results for L3.
Figure 5.68: Total Leakage Energy of L3 Cache
5.3.4.2.3 Total Energy
The total amount of energy savings for L1 is shown in Figure 5.69. This figure is very
similar to Figure 5.63 because the dynamic energy savings dominates the overall energy
savings. This is as expected and consistent with previous results.
100
Figure 5.69: Total Energy Savings of L1 Cache
The total amount of energy savings for L2 is shown in Figure 5.70. This figure is very
similar to Figure 5.67 because the leakage energy savings dominates the overall energy
savings for L2. This is also expected and consistent with previous results.
Figure 5.70: Total Energy Savings of L2 Cache
The total amount of energy savings for L3 is shown in Figure 5.71. This result combines the
results in Figure 5.25 and Figure 5.68 because the total energy savings for L3 is shared be-
tween both dynamic and leakage energy. This is also expected and consistent with previous
results.
101
Figure 5.71: Total Energy Savings of L3 Cache
5.4 Multicore Results
The last set of results gathered for the drowsy phase adaptive cache design is for a Multicore
architecture. Table 5.4 shows the benchmarks used for each simulation and what is running
on each core.
Name Core 1 Thread 1 Core 1 Thread 2 Core 2 Thread 1 Core 2 Thread 2
lbmAll lbm lbm lbm lbm
lbm-dealII lbm lbm dealII dealII
lbm-milc lbm lbm milc milc
All mcf hmmer lbm dealII
Table 5.4: Multicore Experiements
5.4.0.3 Performance
5.4.0.3.1 Configuration Time
Figure 5.72 shows the percentage of time spent in each possible configuration for the L1-
1 cache level. This notation will use used to denote cache level L1 for core 1. Most of
the configuration time is spent in either C0 or C1 due to the high level of locality in L1
accesses. This is as expected and similar to the results shown previously for L1.
102
(a) L1-1 Phase Config Time (b) L1-1 PhaseED Config Time
(c) L1-1 Drowsy Config Time (d) L1-1 DrowsyED Config Time
Figure 5.72: L1-1 Config Distributions
Figure 5.73 shows the percentage of time spent in each possible configuration for the L1-2
cache level. Most of the configuration time is spent in either C0 or C1 due to the high level
of locality in L1 accesses. This is as expected with slight variation from the L1-1 results
shown previously.
103
(a) L1-2 Phase Config Time (b) L1-2 PhaseED Config Time
(c) L1-2 Drowsy Config Time (d) L1-2 DrowsyED Config Time
Figure 5.73: L1-2 Config Distributions
Figure 5.74 shows the percentage of time spent in each possible configuration for the L2-1
cache level. There is a lot of configuration variation for the Phase and PhaseED simu-
lations. The Drowsy simulation spends most of its time in C0 to optimize for the most
energy savings and the DrowsyED simulation spends time in the first few configurations to
optimize for both energy and delay. This is as expected and similar to the results shown
previously for just L2 in both the mulithreaded and single threaded cases.
104
(a) L2-1 Phase Config Time (b) L2-1 PhaseED Config Time
(c) L2-1 Drowsy Config Time (d) L2-1 DrowsyED Config Time
Figure 5.74: L2-1 Config Distributions
Figure 5.75 shows the percentage of time spent in each possible configuration for the L2-2
cache level. These reslts are very simular to the L2-1 cases but do vary slightly. There
is a lot of configuration variation for the Phase and PhaseED simulations. The Drowsy
simulation spends most of its time in C0 and the DrowsyED simulation spends time in the
first few configurations.
105
(a) L2-2 Phase Config Time (b) L2-2 PhaseED Config Time
(c) L2-2 Drowsy Config Time (d) L2-2 DrowsyED Config Time
Figure 5.75: L2-2 Config Distributions
Figure 5.76 shows the percentage of time spent in each possible configuration for the L3
cache level. There is a lot of configuration variation for the Phase and PhaseED simu-
lations. The Drowsy simulation spends most of its time in C0 to optimize for the most
energy savings and the DrowsyED simulation spends time in the first few configurations to
optimize for both energy and delay. This is as expected and similar to the results shown
previously for just L3 in both the mulithreaded and single threaded cases.
106
(a) L3 Phase Config Time (b) L3 PhaseED Config Time
(c) L3 Drowsy Config Time (d) L3 DrowsyED Config Time
Figure 5.76: L3 Config Distributions
5.4.0.3.2 Speedup and IPC
Figure 5.77 shows the Speedup for each of the simulations. It can be seen from these
results that the performance improves in some cases and incurs a small penalty in others.
The speedup ranges from 98.0% to 117.1%. This is as expected and matches results we
have seen previously when all three cache levels are active and multiple threads are running.
107
Figure 5.77: Speedup
5.4.0.4 Energy and Power
5.4.0.4.1 Dynamic Energy
The total dynamic energy savings for L1-1 is shown in Figure 5.78. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
22.5% to 72.9%. The Drowsy and DrowsyED simulations show the most dynamic energy
savings as expected.
Figure 5.78: Total Dynamic Energy Savings of L1-1 Cache
The total dynamic energy savings for L1-2 is shown in Figure 5.79. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
108
23.2% to 73.0%. These results are very similar to the energy results for L1-1.
Figure 5.79: Total Dynamic Energy Savings of L1-2 Cache
The total dynamic energy savings for L2-1 is shown in Figure 5.80. These results show
energy savings for the Drowsy and DrowsyED cases, however they show more energy con-
sumption for Phase and PhaseED cases. The energy savings ranges from -30.9% to 59.9%.
This matches the results shown previously for L2, when L3 is also phase adaptive.
Figure 5.80: Total Dynamic Energy Savings of L2-1 Cache
The total dynamic energy savings for L2-2 is shown in Figure 5.81. These results show
energy savings for the Drowsy and DrowsyED cases, however they show more energy con-
sumption for Phase and PhaseED cases. The energy savings ranges from -28.2% to 30.4%.
These results are similar to the results for L2-1 with only slight expected variation.
109
Figure 5.81: Total Dynamic Energy Savings of L2-2 Cache
The total dynamic energy savings for L3 is shown in Figure 5.82. These results show that
for many cases there is significant energy savings. The total energy savings ranges from
-10.2% at the lowest to 44.8% at the highest. These results are as expected and very similar
to the results stated previously for the mulithreaded case when L3 is active.
Figure 5.82: Total Dynamic Energy Savings of L3 Cache
5.4.0.4.2 Leakage Energy
The total leakage energy savings for L1-1 is shown in Figure 5.83. These results show that
there is leakage savings for all cases ranging from 7.4% to 56.4%. This is as expected and
consistent with the L1 results shown previously.
110
Figure 5.83: Total Leakage Energy of L1-1 Cache
The total leakage energy savings for L1-2 is shown in Figure 5.84. These results show that
there is leakage savings for all cases ranging from 8% to 58%. This is as expected and vary
similar to the leakage results for L1-1.
Figure 5.84: Total Leakage Energy of L1-2 Cache
The total leakage energy savings for L2-1 is shown in Figure 5.85. These results show that
there is large amount of leakage energy savings for L2, ranging from 44.7% to 75.3%. This
is as expected and consistent with the L2 results shown previously.
111
Figure 5.85: Total Leakage Energy of L2-1 Cache
The total leakage energy savings for L2-2 is shown in Figure 5.86. These results show that
there is large amount of leakage energy savings for L2-2, ranging from 44.7% to 75.3%.
These results are as expected and very similar to the leakage results shown previously for
L2-1.
Figure 5.86: Total Leakage Energy of L2-2 Cache
Figure 5.87 shows the total leakage savings in reference to the Baseline simulation for L3.
The savings for L3 ranges from 14.9% to 62%. This leakage savings is as expected and
matches previously presented results for L3.
112
Figure 5.87: Total Leakage Energy of L3 Cache
5.4.0.4.3 Total Energy
The total amount of energy savings for L1-1 is shown in Figure 5.88. This figure is very
similar to Figure 5.78 because the dynamic energy savings dominates the overall energy
savings. This is as expected and consistent with previous results.
Figure 5.88: Total Energy Savings of L1-1 Cache
The total amount of energy savings for L1-2 is shown in Figure 5.89. This figure is very
similar to Figure 5.79 because the dynamic energy savings dominates the overall energy
savings. This is as expected and consistent with previous results as well as the results
shown previously for L1-1.
113
Figure 5.89: Total Energy Savings of L1-2 Cache
The total amount of energy savings for L2-1 is shown in Figure 5.90. This figure is very
similar to Figure 5.85 because the leakage energy savings dominates the overall energy
savings for L2. This is also expected and consistent with previous results.
Figure 5.90: Total Energy Savings of L2-1 Cache
The total amount of energy savings for L2-2 is shown in Figure 5.91. This figure is very
similar to Figure 5.86 because the leakage energy savings dominates the overall energy
savings for L2. This is also expected and consistent with previous results as well as the
results shown previously for L2-1.
114
Figure 5.91: Total Energy Savings of L2-2 Cache
The total amount of energy savings for L3 is shown in Figure 5.92. This result combines the
results in Figure 5.25 and Figure 5.87 because the total energy savings for L3 is shared be-
tween both dynamic and leakage energy. This is also expected and consistent with previous
results.
Figure 5.92: Total Energy Savings of L3 Cache
Overall the multicore results show similar patterns as shown in the multithreaded results.
There is some slight variation shown between the cores in these results. This can be at-
tributed to the fact that each core is not necessarily running the same benchmarks in each
of these simulations. This can also be attributed to the shared L3. Since L3 is shared be-
tween both of the cores’ L2 caches it is possible to get varying behavior from L2 that would
115
then propagate up through to L1. Overall the results for each core are very similar and these
results support the other results for both multithreaded and single threaded cases.
5.5 Summary
These results show that the Drowsy simulations result in both static and dynamic energy
savings. In these cases the first configuration, C0, is generally the optimal configuration.
This allows for high energy savings with a very low performance impact. Due to locality,
it is possible to obtain higher performance from this cache design in some cases. Overall
these results validate the drowsy phase adaptive cache design for all three levels and for
both multithreaded and multicore applications.
116
Chapter 6
Conclusions
This thesis work presented a drowsy phase adaptive cache design that saves both static
and dynamic power while having a low impact on performance. While exploiting the tem-
poral locality of memory accesses and using a cache hierarchy with two partitions these
goals were met for cache levels one, two, and three. This is also shown to be successful
in mulithreaded and multicore applications. In most cases, energy savings occur in the
Drowsy simulations. In these cases the simulations generally spend most of their time in
the C0 configuration. This is the configuration with the smallest A partition and largest B
partition. Since these simulations have the highest energy savings, while incurring a low
and acceptable performance hit, it can be concluded that for many applications this first
configuration is optimal for the Drowsy simulations.
There are many opportunities to expand on this work and investigate this design further.
First, since this work only focuses on one type of multicore architecture, it would be useful
to explore the affect that other multicore systems have on the overall performance. Also
more investigation can be done into continuing to lower the voltage of the Drowsy partition
even more. It may be possible to get even higher leakage savings with a lower voltage,
however this may incur other problems, such as higher latency and data loss.
117
Bibliography
[1] A. Agarwal, Hai Li, and K. Roy. Drg-cache: a data retention gated-ground cache
for low power. In Design Automation Conference, 2002. Proceedings. 39th, pages
473–478, 2002.
[2] Jorge Albericio Latorre, Pablo Ibez Marin, and Jose Maria Llaberia Grino. Improving
the SLLC Efficiency by exploiting reuse locality and adjusting prefetch. PhD thesis,
Zaragoza, Universidad de Zaragoza, Zaragoza, Ago 2013. Presentado: 20 05 2013.
[3] D.H. Albonesi. Selective cache ways: on-demand cache resource allocation. In Mi-
croarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Sympo-
sium on, pages 248–259, 1999.
[4] Dr. Gene M. Amhahl. Validity of the single processor approach to achieving large
scale computing capabilities. In AFIPS Conference Proceedings, pages 483–485,
April 1967.
[5] R. Balasubramonian, D. Albones, A. Buyuktosunoglu, and S. Dwarkadas. Mem-
ory hierarchy reconfiguration for energy and performance in general-purpose proces-
sor architectures. In Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual
IEEE/ACM International Symposium on, pages 245–257, 2000.
[6] Standard Performance Evaluation Corporation. Spec cpu2006 benchmark suite, June
2008.
[7] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D.H. Albonesi, S. Dwarkadas,
G. Semeraro, G. Magklis, and M.L. Scottt. Integrating adaptive on-chip storage struc-
tures for reduced dynamic power. In Parallel Architectures and Compilation Tech-
niques, 2002. Proceedings. 2002 International Conference on, pages 141–152, 2002.
[8] S. Dropsho, G. Semeraro, D.H. Albonesi, G. Magklis, and M.L. Scott. Dynami-
cally trading frequency for complexity in a gals microprocessor. In Microarchitecture,
2004. MICRO-37 2004. 37th International Symposium on, pages 157–168, Dec 2004.
118
[9] M. Farahani, F. Eslami, and A. Baniasadi. Application specific low leakage data
cache for embedded processors. In Green Computing Conference (IGCC), 2013 In-
ternational, pages 1–6, June 2013.
[10] Brendan Fitzgerald. Drowsy cache partitioning for reduced static and dynamic energy
in the cache hierarchy, 2012. Copyright - Copyright ProQuest, UMI Dissertations
Publishing 2012; Last updated - 2014-01-19; First page - n/a; M3: M.S.
[11] K. Flautner, Nam Sung Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches:
simple techniques for reducing leakage power. In Computer Architecture, 2002. Pro-
ceedings. 29th Annual International Symposium on, pages 148–157, 2002.
[12] Y.E. Jiongyao and T. Watanabe. An adaptive width data cache for low power design.
In SoC Design Conference (ISOCC), 2009 International, pages 488–491, Nov 2009.
[13] G. Keramidas, C. Datsios, and S. Kaxiras. A framework for efficient cache resizing.
In Embedded Computer Systems (SAMOS), 2012 International Conference on, pages
76–85, July 2012.
[14] Nam Sung Kim, K. Flautner, D. Blaauw, and T. Mudge. Circuit and microarchitectural
techniques for reducing cache leakage power. Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, 12(2):167–184, Feb 2004.
[15] J. Nemeth, Rui Min, Wen-Ben Jone, and Yiming Hu. Location cache design and
performance analysis for chip multiprocessors. Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, 19(1):104–117, Jan 2011.
[16] Salvador Petit, Julio Sahuquillo, Jose M. Such, and David Kaeli. Exploiting tem-
poral locality in drowsy cache policies. In Proceedings of the 2Nd Conference on
Computing Frontiers, CF ’05, pages 371–377, New York, NY, USA, 2005. ACM.
[17] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-vdd: a
circuit technique to reduce leakage in deep-submicron cache memories. In Low Power
Electronics and Design, 2000. ISLPED ’00. Proceedings of the 2000 International
Symposium on, pages 90–95, 2000.
[18] S. Srikantaiah, E. Kultursay, Tao Zhang, M. Kandemir, M.J. Irwin, and Yuan Xie.
Morphcache: A reconfigurable adaptive multi-level cache hierarchy. In High Perfor-
mance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on,
pages 231–242, Feb 2011.
119
[19] K.T. Sundararajan, T.M. Jones, and N. Topham. Smart cache: A self adaptive cache
architecture for energy efficiency. In Embedded Computer Systems (SAMOS), 2011
International Conference on, pages 41–50, July 2011.
[20] R. Ubal, J. Sahuquillo, S. Petit, and P. Lopez. Multi2sim: A simulation framework
to evaluate multicore-multithreaded processors. In Computer Architecture and High
Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on,
pages 62–68, Oct 2007.
[21] Y. Ye, S. Borkar, and V. De. A new technique for standby leakage reduction in high-
performance circuits. In VLSI Circuits, 1998. Digest of Technical Papers. 1998 Sym-
posium on, pages 40–41, June 1998.
120
Appendix A
Multithreaded Results
A.1 L1 Multithreaded Results
The following set of results are for a multithreaded cache system with just a L1 phase
adaptive cache.
A.1.1 Performance
A.1.1.1 Configuration Time
Figure A.1 shows the percentage of time spent in each possible configuration. For the Phase
and PhaseED configurations there is time spent in all of the configurations. The PhaseED
configuration has more activity in C2 and C3 than the Phase configuration in some cases.
This is due to the added consideration of the delay cost in the overall cost equation. For
the Drowsy and DrowsyED configuration most of the configuration time is spent in the first
two configurations.
121
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure A.1: L1 Config Distributions
A.1.1.2 Speedup and IPC
Figure A.2 shows the Speedup for each of the benchmarks run with a phase adaptive L1
cache. It can be seen from these results that in all cases the performance is better than the
baseline. The performance ranges from 102% to 108%. In general the Phase and PhaseED
simulations have higher speedup than the Drowsy and DrowsyED simulations. This is as
expected due to the added performance hit incurred when the cache is set to drowsy.
122
Figure A.2: Speedup for L1 Simulations
Figure A.3 shows the MRU counter distribution for all of the simulations. These figures
show that a large majority of accesses are to MRU 0. This MRU distribution further ex-
plains why there is a performance benefit for all of the simulations. The majority of access
are to MRU 0 and are therefore to the A partition, allowing for a performance benefit.
(a) Phase MRU Distribution (b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution (d) DrowsyED MRU Distribution
Figure A.3: L1 MRU Distributions
123
A.1.2 Energy and Power
A.1.2.1 Dynamic Energy
The total dynamic energy savings is shown in Figure A.4. These results show that there is
dynamic energy savings for all of the simulations. The savings ranges from 11.3% to 18.1%
for the simulations with two threads. It ranges from 62% to 65% for the simulations with
four threads. In most cases the Drowsy and DrowsyED cases have slightly higher dynamic
energy savings. This is because of the slightly smaller energy values. These results are as
expected for L1.
Figure A.4: Total Dynamic Energy Savings of L1 Simulations
A.1.2.2 Leakage Energy
Figure A.5 shows the total leakage savings. There is leakage savings in both the Drowsy and
DrowsyED cases. This savings is very small ranging from 1.8% to 9.2% for the simulations
with two threads. The savings is much higher for the simulation with four threads; around
56% on average. The leakage savings is small because of the small size of the L1 cache.
The energy values for the different configurations do not vary heavily as shown in Table 4.8.
This means that even though leakage energy is saved with each access to the A partition,
only a small overall percentage is saved. This amount increases with the number of cycles
and accesses, which is why the simulations with four threads show higher savings.
124
Figure A.5: Total Leakage Energy of L1 Simulations
A.1.2.3 Total Energy
The total amount of energy savings is shown in Figure A.6. This figure is very similar to
Figure A.4 because the dynamic energy savings dominates the overall energy savings. This
is not surprising due to the small amount of difference in the leakage energy values and
dynamic energy values for L1.
Figure A.6: Total Energy Savings of L1 Simulations
125
A.2 L2 Multithreaded Results
The following set of results are for a multithreaded cache system with just a L2 phase
adaptive cache.
A.2.1 Performance
A.2.1.1 Configuration Time
Figure A.7 shows the percentage of time spent in each possible configuration. For the
Phase and PhaseED configurations there is a lot of time spent in the C2 configuration. The
PhaseED configuration has more activity in C3 than the Phase configuration in some cases.
This is due to the added consideration of the delay cost in the overall cost equation. For the
Drowsy configuration most of the configuration time is spent in the first configuration. The
DrowsyED configuration has more activity in configurations C1 and C2 than the Drowsy.
This is due to added delay cost in the DrowsyED configuration.
126
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure A.7: L2 Config Distributions
A.2.1.2 Speedup and IPC
Figure A.8 shows the Speedup for each of the benchmarks run with a drowsy phase adaptive
L2 cache. It can be seen from these results that, for L2, there is a performance hit in most
cases however this impact is small. In the best case the performance is the same as the
baseline. In the worst case the performance is 97.1% and 99.5% at the best.
127
Figure A.8: Speedup of L2 Simulations
Figure A.9 shows the MRU distribution for all of the simulations. These MRU distributions
explain why there is more of a performance hit for L2 than for L1. The hits are more
distributed across all of the MRU states rather than just MRU 0. This means that there is a
higher possibility of hits in the B partition.
(a) Phase MRU Distribution (b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution (d) DrowsyED MRU Distribution
Figure A.9: L2 MRU Distributions
128
A.2.2 Energy and Power
A.2.2.1 Dynamic Energy
The total dynamic energy savings is shown in Figure A.10. It can be seen from these
figures that the simulations with two threads have small energy savings while the simulation
with four threads have high energy savings. The energy savings for the simulations with
two threads range from 1.3% to 17.3% while the simulation with four threads has savings
around 46.2%. In general the Drowsy and DrowsyED simulations have the most energy
savings, as expected.
Figure A.10: Total Dynamic Energy Savings of L2 Simulations
A.2.2.2 Leakage Energy
Figure A.11 shows the total leakage savings in reference to the Baseline simulation. There
is a promising amount of leakage savings shown for the Drowsy and DrowsyED cases, with
savings ranging from 44.5% to 45.3% for the cases with two threads. For the case with four
threads there is a savings of around 72.9%. These results are as expected.
129
Figure A.11: Total Leakage Energy of L2 Simulations
A.2.2.3 Total Energy
The total amount of energy savings is shown in Figure A.12. This figure is very similar
to Figure 5.11 because the leakage energy savings dominates the overall energy savings.
This is as expected since the leakage energy makes up a large portion of the overall energy
consumption.
Figure A.12: Total Energy Savings of L2 Simulations
130
A.3 L3 Multithreaded Results
The following set of results are for a multithreaded cache system with just a L3 phase
adaptive cache.
A.3.1 Performance
A.3.1.1 Configuration Time
Figure A.13 shows the percentage of time spent in each possible configuration. For all of
the simulations there is a lot of configuration activity. The Drowsy simulations show the
most time spent in C0 as expected. The DrowsyED simulations show more variation than
the Drowsy simulations due to the addition of delay to the optimization equation. This is
as expected for L3 and matches results shown previously for the single threaded cases.
(a) Phase Config Time (b) PhaseED Config Time
(c) Drowsy Config Time (d) DrowsyED Config Time
Figure A.13: L3 Config Distributions
131
A.3.1.2 Speedup and IPC
Figure A.14 shows the Speedup for each of the benchmarks run with a phase adaptive L3
cache. It can be seen from these results that, for L3, the performance improves for some
cases and for others there is a low performance hit. In the best case the performance is
101%, while in the worst case the performance is 99.2%. This is as expected for L3 and
similar to the single threaded L3 case.
Figure A.14: Speedup of L3 Simulations
Figure A.15 shows the MRU distribution for all of the simulations. These MRU results
further explain the performance. Although the majority of access is to MRU 0 there is a
high percentage of accesses in some of the other MRU states. This restricts the overall
performance slightly, resulting in the performance shown.
132
(a) Phase MRU Distribution (b) PhaseED MRU Distribution
(c) Drowsy MRU Distribution (d) DrowsyED MRU Distribution
Figure A.15: L3 MRU Distributions
A.3.2 Energy and Power
A.3.2.1 Dynamic Energy
The total dynamic energy savings is shown in Figure 5.16. These results show a small
amount of dynamic energy savings for L3 in the cases with only two threads. The Drowsy
and DrowsyED cases for the simulation with four threads has a high energy savings of
around 54%. In only two cases there is a small negative energy savings. This savings
ranges from -10% to -2.9%. This is possibly due to the fact that the configurations only
update every 15k instructions. This allows for the possibility of a very small negative
energy savings. The positive energy savings for the simulations with two threads ranges
from 2.5% to 33.1%. This is as expected and matches the results shown previously for L3.
133
Figure A.16: Total Dynamic Energy Savings of L3 Simulations
A.3.2.2 Leakage Energy
Figure A.17 shows the total leakage savings in reference to the Baseline simulation. The
savings for L3 ranges from 8.8% to 16.3% for the simulations with two threads. The sav-
ings is around 58% for the cases with four threads. These energy savings can be explained
by the drowsy leakage energy values. Since L3 is so large and the B partition is so large the
difference between the Drowsy leakage values and the Phase leakage values is relatively
small. The savings is larger for the four threaded cases because there are more accesses
and cycles providing more overall savings.
Figure A.17: Total Leakage Energy of L3 Simulations
134
A.3.2.3 Total Energy
The total amount of energy savings is shown in Figure A.18. This figure is similar to
Figure A.17 because the leakage energy savings dominates the overall energy savings. This
is as expected since the leakage energy makes up a large portion of the overall energy
consumption. Overall the results presented here are as expected and show similar patterns
as the results shown for single threaded simulations using L3.
Figure A.18: Total Energy Savings of L3 Simulations
