Washington University in St. Louis

Washington University Open Scholarship
McKelvey School of Engineering Theses &
Dissertations

McKelvey School of Engineering

Spring 5-17-2017

Cache Power Optimization Using Multiple Voltage Supplies to
Exploit Read/Write Asymmetry
Dengxue Yan
Washington University in St Louis

Follow this and additional works at: https://openscholarship.wustl.edu/eng_etds
Part of the Engineering Commons

Recommended Citation
Yan, Dengxue, "Cache Power Optimization Using Multiple Voltage Supplies to Exploit Read/Write
Asymmetry" (2017). McKelvey School of Engineering Theses & Dissertations. 233.
https://openscholarship.wustl.edu/eng_etds/233

This Thesis is brought to you for free and open access by the McKelvey School of Engineering at Washington
University Open Scholarship. It has been accepted for inclusion in McKelvey School of Engineering Theses &
Dissertations by an authorized administrator of Washington University Open Scholarship. For more information,
please contact digital@wumail.wustl.edu.

WASHINGTON UNIVERSITY IN ST. LOUIS
School of Engineering and Applied Science
Department of Electrical Engineering

Thesis Examination Committee:
Xuan Zhang, Chair
Roger Chamberlain
Shantanu Chakrabartty

Cache Power Optimization Using Multiple Voltage Supplies
to Exploit Read/Write Asymmetry
by
Dengxue Yan

A thesis presented to
School of Engineering and Applied Science
of Washington University in St. Louis in partial fulfillment of the
requirements for the degree of
Master of Science

May 2017
St. Louis, Missouri

© 2017, Dengxue Yan

Table of Contents
List of Figures .................................................................................................................... iv
List of Tables ..................................................................................................................... vi
Acknowledgments............................................................................................................. vii
ABSTRACT OF THE THESIS ....................................................................................... viii
Chapter 1: Introduction ........................................................................................................1
1.1

Background .......................................................................................................... 1

1.2

Cache Power Ratio ............................................................................................... 1

1.3

Cache Structure .................................................................................................... 3

Chapter 2: Proposed Framework .........................................................................................4
2.1

Frequency vs Supply Voltage .............................................................................. 4

2.2

Power Consumption vs Supply Voltage............................................................... 5

2.3

Fast Power Switch ................................................................................................ 6

2.4

Finite State Machine of Write-Back Cache.......................................................... 8

Chapter 3: System Configuration.......................................................................................10
3.1

SRAM Cell Simulation ...................................................................................... 10

3.2

Cacti Simulation ................................................................................................. 13

3.3

Static Noise Margin ............................................................................................ 16

3.4

Inverter Ring Simulation .................................................................................... 20

3.5

Power Switches Implementation ........................................................................ 22

3.5.1

Cache Operation Finite State Machine .................................................................. 23

3.5.2

Power Switch Logic .............................................................................................. 24

3.5.3

Architecture ........................................................................................................... 25

3.5.3

VCS Simulation ..................................................................................................... 26

3.6

Power Switches Circuit Design .......................................................................... 27

3.7

Gem5 Simulation................................................................................................ 29

Chapter 4: Result Evaluation .............................................................................................31
4.1

Dynamic/Leakage Power Ratio .......................................................................... 31

4.2

Power Savings Ratio per Cache ......................................................................... 34

4.3

Power Savings Ratio to CPU ............................................................................. 35
ii

4.5

Power Savings of Benchmarks by Gem5 ........................................................... 36

Chapter 5: Future Work .....................................................................................................37
Chapter 6: Conclusions ......................................................................................................38
References ..........................................................................................................................39
Appendix A ........................................................................................................................41
Appendix B ........................................................................................................................44
B.1 Gem5 Compiling .................................................................................................... 44
B.2 Run Benchmarks on Gem5 [24] ............................................................................. 44

iii

List of Figures
Figure 1.1 Devices using processor [1]............................................................................... 1
Figure 1.2 Battery-operated devices [2].............................................................................. 1
Figure 1.3 Die of common processor [3] ............................................................................ 1
Figure 1.4 CPU power breakdown when run SPEC benchmark on Gem5 ........................ 2
Figure 1.5 ARM946E-S 8KB, 4 set-associative, 32 bytes per line cache structure [7] ...... 2
Figure 2.1 Max frequency vs power supply........................................................................ 4
Figure 2.2 Power consumption vs power supply ................................................................ 5
Figure 2.3 Fast power switches ........................................................................................... 6
Figure 2.4 Behaviors of Fast power switches ..................................................................... 7
Figure 2.5 FSM of write-back cache [9] ............................................................................. 8
Figure 3.1 SRAM cell circuit with drivers........................................................................ 12
Figure 3.2 6T SRAM cell.................................................................................................. 13
Figure 3.3 Simplified 6T SRAM cell ................................................................................ 13
Figure 3.4 Power consumption vs voltage supply of TAG RAM (Cacti) ........................ 15
Figure 3.5 Power consumption vs voltage supply of DATA RAM (Cacti)...................... 15
Figure 3.6 Circuit to measure the SRAM hold Static Noise Margin ................................ 17
Figure 3.7 Circuit to measure the SRAM read Static Noise Margin ................................ 18
Figure 3.8 Circuit to measure the SRAM write Static Noise Margin (Write 0) ............... 18
Figure 3.9 Circuit to measure the SRAM write Static Noise Margin (Write 1) ............... 18
Figure 3.10 Hold noise margin ......................................................................................... 19
Figure 3.11 Read noise margin ......................................................................................... 19
Figure 3.12 Write noise margin ........................................................................................ 19
Figure 3.13 Static noise margin vs power supply ............................................................. 20
Figure 3.14 circuit for invertor ring simulation ................................................................ 21
Figure 3.15 Power reduction ratio when voltage decreases from 1.2 V to 0.9V .............. 22
Figure 3.16 Power reduction ratio when voltage decreases from 1.2 V to 0.4V .............. 22
Figure 3.17 Power switch logic of TAG RAM ................................................................. 24
Figure 3.18 Power switch logic of DATA RAM .............................................................. 24
Figure 3.19 Whole power switch control functional block............................................... 26
Figure 3.20 Data-path of 64K two-way associative cache including power switches...... 26
Figure 3.21 System diagram after integrating power switch circuit ................................. 27
Figure 3.22 Average power duration ration for dynamic power of SPEC benchmark ..... 30
Figure 3.23 Average power duration ration for leakage power of SPEC benchmark ...... 30
Figure 4.1 Dynamic/Leakage power breakdown .............................................................. 31
Figure 4.2 Power breakdown to TAG and DATA RAM .................................................. 34
Figure 4.3 Power breakdown of benchmark by Gem5 ..................................................... 36
Figure A.1 Circuit to measure leakage current of nMOS on ............................................ 42
Figure A.2 Circuit to measure leakage current of nMOS off ........................................... 42
iv

Figure A.3 Circuit to measure leakage current pMOS on ................................................ 42
Figure A.4 Ployfit function of nMOS on .......................................................................... 42
Figure A.5 Ployfit function of nMOS off ......................................................................... 42
Figure A.6 Ployfit function of pMOS on .......................................................................... 43
Figure A.7 Ployfit vs original curve of nMOS on ............................................................ 43
Figure A.8 Ployfit vs original curve of nMOS off ............................................................ 43
Figure A.9 Ployfit vs original curve of pMOS on ............................................................ 43

v

List of Tables
Table 1.1 Cache power ratio to the CPU power ................................................................. 1
Table 3.1 Voltages selection ............................................................................................. 13
Table 3.2 Power saving ratio at different voltage to V1 ................................................... 15
Table 4.1 Dynamic/Leakage power breakdown ............................................................... 31
Table 4.2 TAG RAM ratio of 4 way-associative 32 bytes per line cache ........................ 33
Table 4.3 TAG RAM ratio of 4 way-associative 64 bytes per line cache ........................ 33
Table 4.4 Power breakdown to TAG and DATA RAM ................................................... 33
Table 4.5 Power saving ratio of every cache .................................................................... 35
Table 4.6 Power saving ratio to the CPU.......................................................................... 35

vi

Acknowledgments
First, I would like to thank my thesis advisor Professor Xuan Zhang of Electrical
Engineering at Washington University in St. Louis. She consistently instructed my
research and this thesis to the right direction whenever she thought I needed it.
I would also like to thank my thesis committee members: Professor Roger Chamberlain
and Professor Shantanu Chakrabartty. Without their committed participation and
feedback, my thesis defense could not have been successfully conducted.
Thirdly, I would like to acknowledge all the members in Professor Xuan 'Silvia' Zhang’s
lab to review and proofread my thesis, and I am gratefully indebted to their very valuable
comments on this thesis.
Finally, I must express my very profound gratitude to my parents and godparents for
providing me with unfailing support and continuous encouragement throughout my years
of study. The accomplishment of my master study would not have been possible without
them.
Dengxue Yan
Washington University in St. Louis
May 2017

vii

ABSTRACT OF THE THESIS
Cache Power Optimization Using Multiple Voltage Supplies
to Exploit Read/Write Asymmetry
by
Dengxue Yan
Master of Science in Electrical Engineering
Washington University in St. Louis, 2017
Research Advisor: Professor Xuan Zhang

Power consumption becomes more and more critical in computer systems nowadays.
Most of the previous work has been focusing on general-purpose computational core, but
optimization techniques for conventional CPU core has reached a limit. Our experimental
results show that read operations in SRAM can be performed at a lower supply with
much reduced power consumption compared to write operations. Based on this
observation and the fact that cache, consisting mostly of SRAM, often occupies
significant on-chip area of the CPU and consumes a huge portion of the CPU power, we
propose a new method to reduce the power consumption of cache. By dynamically
switching the cache voltage supply between a lower voltage for read and a higher voltage
for write, our method can effectively reduce cache power without affecting the
performance of the multi-level cache hierarchy in a computer system. We can realize
further power savings by lowering the supply below read voltage for hold-only operations
when the cache is idle. Both the power switching controller implementation and the
power consumption statistics from various SPEC benchmarks will be presented to
demonstrate the efficiency of our proposed methods.

viii

Chapter 1: Introduction
Moore’s law has guided the development of the semiconductor industry for decades.
However, recent studies have shown that computers have been approaching physical
limits of Moore's law. Among these physical limitations, power is the most important
one. Therefore, this thesis will focus on power saving methodology of processor. And in
this chapter, we will describe the background and the basic knowledge that will be used
through this thesis.

1.1 Background
Figure 1.1 shows that processor is ubiquitous. It is embedded in almost all smart devices
such as cars, wearable devices, wireless sensor, and so on. As technology scales and
processor speeds improve, power has become a first-order design constraint in all aspects
of processor design.
In addition, mobile device, battery-operated devices such as wearable device, wireless
sensor, battery-less IoT (Internet-of-Things) are the hot technology topics (Figure 1.2). In
these devices, power consumption of processor plays very important role for the battery
duration.

Figure 1.2 Battery-operated devices [2]

Figure 1.1 Devices using processor [1]

1

Therefore, saving power consumption of processor is very important and a lot of
researches have been working on it. However, most of those researches focus on the core
of the processor and the optimization techniques of processor power consumption have
reached its limitation.

1.2 Cache Power Ratio
Commonly, a significant portion of the processor die is occupied by on-chip caches
(Figure 1.3). As a result, cache consumes great amount of the processor power.

Figure 1.3 Die of common processor [3]

Table 1.1 Cache power ratio to the CPU power

CPU
Niagara-1
Niagara-2
Alpha21264
StrongARM

Cache Type
L2 Cache
L2 Cache
L1 Cache
L1 Cache

Cache Size
256 KB
256 KB
64 KB
2 x 16 KB

Power
22%
24%
16%
30%

Ref.
[4]
[4]
[5]
[5]

In Table 1.1, we survey the power consumption ratio of the cache to the entire processer
published in previous work [4,5], which indicates that L1 cache consumes about
16%~30% of the total processer power and L2 cache power occupies more than 20% of
1

the total processer power. In addition, the paper “Dynamic Zero Compression for Cache
Energy Reduction [6]” presents that about 30%~60% of processor power is consumed by
total cache (including L1, L2, L3 cache). This ratio matches the number of our SPEC
benchmark simulation result running on Gem5 which is marked by red dashed rectangle
in Figure 1.4. Therefore, to save the power of processor, it is essential to save the cache
power.

Figure 1.4 CPU power breakdown when run SPEC benchmark on Gem5

Figure 1.5 ARM946E-S 8KB, 4 set-associative, 32 bytes per line cache structure [7]

2

1.3 Cache Structure
We show in Figure 1.5 an 8KB, 4 set-associative, 32 bytes per line cache structure. The
RAM of cache basically separates to two part: TAG SRAM and DATA SRAM. DATA
RAM is used for data storage while the TAG RAM is used to store associated address
information. For the above 8KB cache, the size of TAG SRAM is 21 × 26 ×22 bits. As
the cache size increase or the set number decreases, the size of the TAG SRAM
increases. In addition, the TAG RAM is always accessed in parallel while only the index
matched cache lines are activated when doing read/write operation, which means TAG
RAM consumes a big portion of dynamic power. Therefore, the huge quantity of TAG
RAM cannot be omitted for power saving. There are other control bits such as valid,
protect and so on, which could be associated either with TAG RAM or DATA RAM.
And in the discussion of following chapter, the TAG RAM and DATA RAM will be
discussed separately because of their different characteristics.

3

Chapter 2: Proposed Framework
In Chapter 1, we have discussed that cache consumes a significant portion of processor
process that saving power consumption of cache is essential to reduce the CPU power. In
this chapter, we will introduce the framework of our new method to reduce power
consumption of cache by switching power supplies of different part of cache to lower
voltages according to the different cache operation.

2.1 Frequency vs Supply Voltage
Our research shows that at the same supply voltage, the read frequency of SRAM is
higher than write frequency as shown by the vertical dash line in Figure 2.1. In other
words, at the same max frequency, the voltage for SRAM read is lower than write (The
horizontal dash line in Figure 2.1).

Figure 2.1 Max frequency vs power supply

Therefore, the traditional SRAM which is powered by only one power supply uses write
frequency work at the rated voltage as the reference of SRAM read/write frequency
because the read operation is safe at this frequency and voltage. As a result, the read
4

operation works at the higher voltage than needed. Figure 2.1 shows that at the same
frequency, the voltage supply for the read operation could be lower than the write
voltage, which indicates that, without affecting efficiency, the read operation of SRAM
could work at a lower voltage than write operation.

2.2 Power Consumption vs Supply Voltage
In section 2.1, we discussed that the voltage for SRAM read could be lower than the rated
voltage. But what is the benefit if we lower the read voltage? Figure 2.2 answers the
question. Figure 2.2 shows the relationship between the power consumption of invertor
ring and the supply voltage. The invertor ring is often used to simulate the digital logic.
As supply voltage decreases, the dynamic power of SRAM decreases quadratically and
the leakage power decreases exponentially. This fact testifies that if we could switch
power supply to a lower voltage for SRAM read operation, a significant power will be
saved.

Figure 2.2 Power consumption vs power supply

5

Furthermore, the figure shows that, as the voltage supply decreases to less than 0.4 V,
more than 90% leakage power could be saved for digital circuit. In the later chapter, we
will show that SRAM could hold its value even if the voltage is as low as 0.4 V and the
lower level caches (L2, L3 caches) are in data hold stage during most of time.
Therefore, switching to a lower voltage supply during read operation and to a much lower
voltage during the data hold stage could reduce great amount of cache power
consumption.

2.3 Fast Power Switch
Even though lower the voltage for different stages could save cache power consumption,
a fast voltage switch is needed to quickly shift voltage supply from one to another.
Fortunately, T. N. Miller proposed such a fast power switch in his paper “Booster:
Reactive Core Acceleration for Mitigating the Effects of Process Variation and
Application Imbalance in Low-Voltage Chips [8]”. Figure 2.3 shows the circuit of the
switch and its transition behaviors are shown in Figure 2.4.

Transition time < 10ns

Figure 2.3 Fast power switches

6

Figure 2.4 Behaviors of Fast power switches

In the paper mentioned above, the power supply could be switched from one to another
within 10 nanoseconds. Even though this speed is not fast enough to switch the power
supply at the speed of CPU core, it makes the power supply switching possible. During
the long consecutive read and long idle stage of cache, the operation time could be much
longer than the switching time. Therefore, switching voltage supply according to the
cache operations is possible.
Another issue needs to be considered is that the delay of the switches might affect the
performance of the cache. In this thesis, we will leave the L1 cache DATA RAM always
on the rated voltage which implies it is on the highest performance and highest power
consumption all the time and let operating system to handle the delay of power supply
switching of L2 and L3 cache. In Section 3.5, We will further discuss this problem when
we discuss the system implementation of our design. Among this thesis, we will discuss
power supply switching of TAG RAM of all level caches and DATA RAM of L2 and L3
caches. But, the next question is when to switch the power supply.
7

2.4 Finite State Machine of Write-Back Cache
Even though switching the power supply at each instruction cycle is unrealistic,
consecutive read/write and long idle stage exists during the cache operation. After we
carefully study the finite state machine (FSM) of common write-back cache controller
shown in Figure 2.5, we know that the lower level cache is in idle when higher level
cache hit happens. Only when cache miss occurs, the lower level cache will be read.
Furthermore, if cache miss happens and the cache is dirty, it will activate the write
operation of the lower level cache. Since the write/read operation of cache works on
cache line, which is a data block, it satisfies the consecutive write/read operation that we
discussed in section 2.3. In addition, from our SPEC benchmark simulation on Gem5, we
found that the cache hit ratio is greater than 50% for L1 cache and greater than 90% for
L2 cache and L3 cache. This means L2 cache is in idle for more than 50% of the time and
L3 is in idle for more than 90% of the time. During the idle status, the caches are just to
hold their data without sdoing any operation, which could save more than 90% power if
the voltage is lower to the lowest data hold voltage. Therefore, switching to different
power supply according to the status of cache will save a lot of power. Future discussion
will be presented in Section 3.5 and Chapter 4.

Evict
Process

Reset

IDLE

Activate lower
layer memory
write operation

valid &&
dirty && miss
Read or write
Request

(!valid) ||
Tag
Refill
Check ((!dirty) && miss) Process
valid && hit

Activate lower
layer memory
read operation

Read or Write

Data
Read or
Write

Figure 2.5 FSM of write-back cache [9]

8

All the facts discussed in Section 2.1 to Section 2.4 direct us to the new method to reduce
the power consumption of cache: Switch to a lower power supply voltage during cache
read operation and to a much lower voltage during cache idle stage to reduce cache
power consumption as much as possible.
The new method has been proven and testified by our simulations’ result, which we will
discuss in detail in Chapter 4. Before that, we will offer detailed configurations of our
simulations to verify our proposal and to estimate the related result in Chapter 3.

9

Chapter 3: System Configuration
Haven proposed the new method to save the power consumption of cache, we will
discuss in this chapter the system configurations used to simulate and verify this idea and
to evaluate the corresponding result. The work we have done includes SRAM cell
simulation, Cacti simulation, static noise margin measurement, inverter ring simulation,
finite state machine analysis, power switch circuit and Gem5 simulation. We will discuss
these experiments in detail one by one in this chapter.

3.1 SRAM Cell Simulation
To find the relationship between read/write frequency and the supply voltage, we build in
Cadence a 6T SRAM cell with drivers which is shown in Figure 3.1 using CMOS
technology in standard 130nm process.
The 6T SRAM cell is commonly used SRAM cell structure. Figure 3.2 shows its detailed
circuit while Figure 3.3 is its simplified diagram. The cross-coupled invertors are used to
store one bit and the access transistors work as switch controlled by word line to connect
the SRAM cell to the bit lines. To maintain read and write stability, the pull-up transistors
(M2, M4) must win the access transistors (M5, M6) which must win the pull-down
transistors (M1, M3) [10].
The pre-charge circuit is used to pre-charge the BL and BL and make sure BL and BL at
the same voltage level before read/write operation. The common voltage for pre-charge is
Vdd or Vdd/2. In our simulation, we pre-charge BL and BL to Vdd.
When a read operation is issued, the related WL is activated and the BL and BL is driven
by cross-coupled invertors. Then sense amplifier is used to speed up the small differential
10

voltage between BL and BL to be quickly amplified to Vdd, which accelerates the read
process. From Figure 3.1, we know that sense amplifier is basically cross-coupled
invertors when it is activated.
The usage of write-driver circuit is to drive the SRAM cell to opposite state (1 to 0 or 0 to
1). Since the cross-coupled invertors has strong ability to hold its data, to invert this state,
driving on the BL and BL must be strong enough. This is what write-driver circuit is used
for. The write-driver is also cross-coupled invertors. But the width of the pMOS and
nMOS transistors are much greater than the ones in SRAM cell. Therefore, it has strong
driving ability which ensures the reliability of the SRAM cell write process.
Using the SRAM cell circuit we designed in Figure 3.1 and by sweeping the power
supply from 1.2V to 0.4V, we get the curve of relationship between max frequency and
supply voltage which is shown in Figure 2.1. The figure shows that at the same max
frequency the voltage for read is lower than write. For example, at max frequency of 3.2
GHZ, the power supply for write is 1.2 V, but for read is less than 0.9 V. Therefore,
without decreasing read frequency, we could lower the read operation of SRAM to 0.9 V,
which leads to an exponential power saving.
Besides of the fact that voltage for read is lower than write voltage, the simulation of
SRAM cell also shows that the data hold voltage of SRAM could be lower than 0.4 V.
Therefore, it is safe for our simulations to select 0.9 V for the read operation and 0.4 V
for idle state for 130 nm CMOS technology with rated voltage 1.2 V which we used
throughout our Cadence simulation. Table 3.1 shows the voltages we select for write,
read and hold operations and their aliases we used in this thesis.
11

Pre-Charge Circuit
SRAM Cell Circuit
with drivers
EN

Pre-Charge
EN

BL

BL
BL

BL

6T SRAM Cell
WL

SRAM Cell
WL
BL

BL

Q

BL
Q

BL

Q

Q

Q

Q

Sense Amplifier

Sense Amp
EN

Ro

Ro
Ro

Ro

EN

Write-DRV
BL

Write Driver

BL
EN
Din

EN

Vdd

Gnd

Din

Figure 3.1 SRAM cell circuit with drivers

12

WL
Vdd

Vdd

M4

M2

BL

M6

M5

M3

M1

Q

BL

Q

M5, M6: Access transistors
M2, M4: Pull up transistors
M1, M3 Pull down transistors
BL: Bit Line
WL: Word Line
Gnd

Gnd

Figure 3.2 6T SRAM cell

WL

BL

M5

M6

Q

BL

Q

Figure 3.3 Simplified 6T SRAM cell

Table 3.1 Voltages selection

Operation
Type

Supply Voltage
(In our simulation)

Voltage
Name

WRITE
READ
HOLD

1.2 V
0.9 V
0.4 V

V1
V2
V3

3.2 Cacti Simulation
After voltages are selected, we deploy Cacti simulator [11] to systematically evaluate the
performance of cache after switching the power supplies. Cacti is an integrated cache and
memory leakage, and dynamic power model. Since the voltage and current of MOSFET
of original version of Cacti are fixed, we measure the current of nMOS and pMOS at the
13

different voltage supplies in Cadence and then use polynomial curve fitting function in
Matlab to build the polynomial equation to express current using the power supply
voltage (Appendix A). After we get the equation, we use it to substitute the fixed current
module in the original code. Now, we could sweep the voltage supply to get the power
consumption of SRAM using Cacti and the result is shown in Figure 3.4 and Figure 3.5.
For TAG RAM:
1) As voltage decreases from 1.2 V to 0.9 V (Figure 3.4):
• Dynamic power consumption reduces > 40%
• Leakage power consumption reduces > 60%
2) As voltage decreases from 1.2 V to 0.4 V (Figure 3.5):
• Leakage power consumption reduces > 90%
For DATA RAM:
1) As voltage decreases from 1.2 V to 0.9 V (Figure 3.4):
• Dynamic power consumption reduces > 30%
• Leakage power consumption reduces > 50%
2) As voltage decreases from 1.2 V to 0.4 V (Figure 3.5):
• Leakage power consumption reduces > 90%
Based on Figure 3.4 and 3.5, we get the estimated power saving ratios listed in Table 3.2
for the different power supplies we select in Table 3.1.

14

Figure 3.4 Power consumption vs voltage supply of TAG RAM (Cacti)

Figure 3.5 Power consumption vs voltage supply of DATA RAM (Cacti)

Table 3.2 Power saving ratio at different voltage to V1

Power
Supply
V1
V2
V3

Power Saving
TAG RAM
DATA RAM
Dynamic Leakage Dynamic Leakage
0%
0%
0%
0%
40%
60%
30%
50%
-90%
-90%

15

3.3 Static Noise Margin
Since static noise margin (SNM) is an important evaluation of the SRAM cell stability,
we design circuits to measure the SNMs of SRAM cell after we select the voltages for
write/read/hold and get the power saving ratio for each power supply.
There are three types of SNM of SRAM cell which are hold SNM, read SNM and write
SNM and which reflects three phases of SRAM cell. The circuits that we used to measure
these noise margins are shown in Figure 3.6 to Figure 3.9. These are traditional circuits
used to draw the well-known butterfly curves that plot the voltage transfer characteristics
(VTC) of the circuit’s feed-forward and feed-back inverters on a single plot. To measure
the SNMs of SRAM, we need to breakdown the cross-coupled invertors and then sweep
the voltage from Gnd to Vdd at the input end and measure the voltage at the output end of
the invertor. After putting the two VTC curve together, we got the butterfly curves and
then we could measure the SNMs by their definitions.
In the circuit shown in Figure 3.6 to Figure 3.9, the power supply in blue is the input of
the invertors of SRAM cell after the cross-couple is broken. To get the VTC curve, we
need to sweep its voltage from Gnd to Vdd as mentioned above. Then we could get the
VTC curve at Q and Q end. The parts marked in red in these circuits indicate the state of
BL, BL and WL during the SNM measuring process. Since during the data hold stage, the
word line of SRAM cell is de-activated, we apply 0 (Gnd) to word line which is shown in
Figure 3.6 when measure the hold SNM. Therefore, the access transistors (M5, M6) are
off, which isolates the cross-coupled invertors with BL and BL. Thus, the voltage level
on BL and BL has litter effect on the hold noise margin. However, during the read and
write operation, the access transistors are on. As a result, we apply 1 (Vdd) to WL when
16

we measure the write/read noise margin which are shown in Figure 3.7 to Figure 3.9.
Since during the read operation, both BL and BL are pre-charged to 1 (Vdd), while one of
BL and BL is driven to 1 (Vdd) and another is driven to 0 (Gnd) by write-driver during
the write operation, the difference between the circuits of measuring read and write noise
margin is that, in the circuit of measuring read noise margin, both BL and BL are applied
to 1 (Vdd), which is shown in Figure 3.7, while one of BL and BL is applied 1 (Vdd) and
another is applied to 0 (Gnd) for measuring the write noise margin, which are shown in
Figure 3.8 and Figure 3.9. Since we use ideal symmetric invertors in the SRAM cell in
our simulation, the noise margin of read/write 0 and 1 are the same. In addition, since
there are several types of SRAM write noise margin, we use the one presented in
[12,13,14,15]. In this method, the write SNM might be greater than the read SNM
because write SNM reflects the difficulty to invert the cross-coupled invertors.
Figure 3.10~3.12 shows the results of hold, read, write static noise margin of SRAM cell
we measured when power supply is equal to 1.2 V.

Hold SNM
WL

WL

BL

BL

Q

Q

Figure 3.6 Circuit to measure the SRAM hold Static Noise Margin

17

Read SNM
WL

WL

BL

BL

Q

Q

Figure 3.7 Circuit to measure the SRAM read Static Noise Margin

Write SNM (Write 0)
WL

WL

BL
BL

Q

Q

Figure 3.8 Circuit to measure the SRAM write Static Noise Margin (Write 0)

Write SNM (Write 1)
WL

WL

BL

BL

Q

Q

Figure 3.9 Circuit to measure the SRAM write Static Noise Margin (Write 1)

18

HSNM

Figure 3.10 Hold noise margin

RSNM

Figure 3.11 Read noise margin

WSNM

Figure 3.12 Write noise margin

19

By using the circuit in Figure 3.6 to 3.9, we measure the static noise margin of SRAM
cell at different supply voltages which is shown in Figure 3.13. The curve in Figure 3.13
shows that as the supply voltage decreases from 1.2 V to 0.9 V, the read noise margin is
nearly unchanged (Reduces less than 0.04 V) compare to the steep reduction of write
noise margin. Therefore, switching voltage from 1.2 V to 0.9 V is safe for SRAM read
operation. Furthermore, when the supply voltage drops to 0.4 V, the hold noise margin is
almost equal to the read noise margin at supply voltage at 1.2 V, which implies that lower
the voltage supply to 0.4 for SRAM to hold its data is realistic.

Figure 3.13 Static noise margin vs power supply

3.4 Inverter Ring Simulation
In order to study the whole system, we also build fanout of 4 (FO4) inverter ring [16,17]
in cadence to estimate the dynamic and leakage power of the core. The FO4 inverter
delay is a standard technology benchmark used to predict delay of more complex circuits.
In our experiment, we use FO4 inverter ring to simulate the behavior of core and estimate

20

its nominal dynamic and leakage power. Figure 3.14 shows the circuit of a five phase
FO4 invertor ring we build in cadence. It turns out it is worth to study invertor ring to
evaluate the power consumption trend of the digital logic from our simulation result.

Inverter Ring Simulation

Vdd

Gnd

Figure 3.14 circuit for invertor ring simulation

The simulation result of the relationship between normalized power consumption value
and supply voltage by invertor ring simulation is shown in Figure 3.15. It presents a
similar result as Cacti simulation (especially the result of TAG RAM):
1) As voltage decreases from 1.2 V to 0.9 V (Figure 3.12):
• Dynamic power consumption reduces > 40%
• Leakage power consumption reduces > 60%
2) As voltage decreases from 1.2 V to 0.4 V (Figure 3.13):
• Leakage power consumption reduces > 90%
Besides of five phases inverter ring, the three phases and seven phases invertor rings are
simulated too and their results show the similar trends as the Figure 3.15 and Figure 3.16.
Therefore, the result of invertor rings further verifies the correctness of our proposed
idea.
21

Figure 3.15 Power reduction ratio when voltage decreases from 1.2 V to 0.9V

Figure 3.16 Power reduction ratio when voltage decreases from 1.2 V to 0.4V

3.5 Power Switches Implementation
Haven proven the feasibility of the new method, we will first discuss in this section the
finite state machine of the cache operation and determine when to switch the power
supplies. Then we will present our power switch logic and system architecture in detail.
At last of this section, the data path of our control logic which we have built in Verilog
HDL and verified in VCS will be discussed.

22

3.5.1 Cache Operation Finite State Machine
As shown in Figure 2.5, during the cache hit period, the lower level cache is in idle state
and the lower level cache read and write operation is only activated when cache miss
occurs. Therefore, the lower level cache could work at the hold voltage which is V3 when
there is no cache miss because our simulation result shows that over 50% cache hit rate is
for L1 cache and 90% for L2 cache, which implies a long consecutive idle state.
Furthermore, when cache miss or invalidation occurs, the lower level cache will be read,
and the power supply voltage should be V2. And if the cache is dirty and miss, the write
operation of the lower level cache is activated which means the cache must work at V1,
which is the original rated voltage given by the manual. Besides, the cache is normally
operated by cache line when the cache miss occurs, which is basically block operation
and long consecutive read/write operation. So, it is suitable to switch the power supply to
V2 during the read operation and V1 during the write operation.
What’s more, the operation for TAG RAM and DATA RAM during the cache miss has
significant difference because the only operation of the TAG RAM is read when there is
no cache miss which is not satisfied for the DATA RAM. Another fact is that the TAG
RAM consumes more dynamic power than DATA RAM because the TAG RAM is
accessed in parallel during one read/write operation to shorten the access delay while
only the index-matched cache lines is activated. But DATA RAM consumes much larger
leakage power because its size is much greater than the size of the TAG RAM. Therefore,
we need to consider the TAG RAM and DATA RAM separately. Furthermore, we need
to study dynamic and leakage power of TAG RAM and DATA RAM respectively.

23

To sum up, we know exactly when the cache idle state and read/write operation from the
finite state machine of the cache operation shown in Figure 2.5. So, we could build our
power switch logic and analyze the dynamic and leakage power for each RAM in each
cache layer. In the next chapter, we will further discuss the result evaluation of these
assumptions.

3.5.2 Power Switch Logic
For the TAG RAM, its contents are only updated when the cache miss occurs. And
during cache miss period, the TAG RAM needs to be written, so the supply voltage must
be V1. However, if the upper layer cache miss occurs and no current layer cache miss,
the TAG RAM is only read. Therefore, the supply voltage should be V2. For the rest of
time, the TAG RAM should be in idle which is powered by V3. So, the power switch
logic of TAG RAM looks like Figure 3.17.

Cache miss

Switch to V1

Upper layer cache miss

Switch to V2

Figure 3.17 Power switch logic of TAG RAM

Cache miss
Upper layer cache miss

||

Switch to V1

&&

Upper layer cache line dirty
Upper layer cache miss

Switch to V2

Figure 3.18 Power switch logic of DATA RAM

For the DATA RAM, the logic for power supply V2 is the same as the logic of TAG
RAM, because when the upper layer cache miss happens, the current layer cache needs to
be read. But for control logic of V1, besides of cache miss, the DATA RAM also must be
24

updated when upper layer cache miss and dirty. Therefore, the logic for the power supply
V1 of DATA RAM is:
𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑙𝑎𝑦𝑒𝑟 𝑐𝑎ℎ𝑐𝑒 𝑚𝑖𝑠𝑠 || ((𝑈𝑝𝑝𝑒𝑟 𝑙𝑎𝑦𝑒𝑟 𝑐𝑎𝑐ℎ𝑒 𝑚𝑖𝑠𝑠) && (𝑈𝑝𝑝𝑒𝑟 𝑙𝑎𝑦𝑒𝑟 𝑐𝑎𝑐ℎ𝑒 𝑑𝑖𝑟𝑡𝑦))
Therefore, the final power supply control logic of DATA RAM is shown in Figure 3.18.

3.5.3 Architecture
Based on the logic we discussed in section 3.5.2, we build the complete functional block
of our power switch control system for L1, L2 and L3 (LLC: Last Level Cache) which is
shown in Figure 3.19. The content at left and right margin describes power control logics
of each power supply for each cache layer.
Since L1 is connected to CPU, it will never be in idle. And because the operation of L1
cache DARA RAM is random, the DATA RAM is always on V1. However, the TAG
RAM of L1 cache is normally on V2, because only when the cache miss happens, TAG
RAM is written which must work at V1. But it is never on V3 because idle never happens
on L1 cache.
In addition to power control logic, the priority of the power supplies is considered in our
system. The priority of V1 is always greater than V2, while the priority of V2 must be
greater than V3, because V1 could be used for all read/write/hold operation, but V2 is not
suitable for write operation, and V3 is only suitable for data hold stage. It means V3
could be replaced by V2 or V1, and V2 could be replace by V1, but the opposite
operations are not allowed.

25

Remark:
V1: Higher voltage (Could be used for write, read, and hold)
V2: Medium voltage (Could be used for read and hold, but can not be used for write)
V3: Lower voltage (Only could be used for hold)
Priority:
V1 > V2 > V3
CPU
Operation

V1: L1 cache miss
V2: Normal

L1 Tag RAM

L1 Data RAM

V1: L2 cache miss
V2: L1 cache miss
V3: Normal

L2 Tag RAM

L2 Data RAM

V1: L3 cache miss
V2: L2 cache miss
V3: Normal

L3 Tag RAM

L3 Data RAM

V1: Normal
V1: L2 cache miss ||
(L1 cache miss && L1 cache line dirty)
V2: L1 cache miss
V3: Normal
V1: L3 cache miss ||
(L2 cache miss && L2 cache line dirty)
V2: L2 cache miss
V3: Normal

V1: TLB miss ||
(L3 cache miss && L3 cache line dirty)
V2: L3 cache miss
V3: Normal

Shared RAM

Memory
read finish
Memory
read finish

in
offset

out
Valid Flag
Array
(2x2Kx1b)

addr_rd

data_rd

MK
Addr

Cache_miss or hit
(mem operation

Mem addr sel

CMP

addr_wr
Tag rd en
Tag wr en

Tag
data_wr
Array
rd
(2x2Kx17b)
wr

MK
Addr

data_rd_mem

Write data sel

Memory
Response

addr_rd

Cache rd en
Cache wr en

Mem rd en

Mem wr en

addr[3:2]
data_rd

127:96

addr_wr

95:64

data_wr Data
Array
rd
(2x2Kx128b)
wr

63:32

31:0

4x32b

Cache miss/dirty/valid

rd_mem Dirty Flag
Array
offset (2x2Kx1b)

addr_resq

out

wr

data_rd

Memory
read finish

data_wr_mem addr_mem rd_mem wr_mem

Cache wr en

addr[31:15]

addr[14:4] + set index[0]

Cache rd en

Cache wr en

addr[14:4] + set index[0]
addr[31:15]

addr_req

data_wr

Cache
Request

Cache_miss/dirty/valid bit

Combine

rd

wr

Figure 3.19 Whole power switch control functional block

Power Switch
Request

Cache
Response

Memory
Request

Figure 3.20 Data-path of 64K two-way associative cache including power switches

3.5.3 VCS Simulation
The logic discussed in Section 3.5.1 is not difficult to implement in Verilog HDL. Figure
3.20 shows the data path [9] of a 64K two-way associative 16 bytes per line cache we
simulated in VCS. Compare to the original data path, we add into it the power switch
26

control logic which is marked by the red dashed rectangle in Figure 3.20. It also shows
that our design is not difficult to fit into the current cache controller. In addition, the
dashed blue rectangle marks the separated TAG and DATA RAM in our cache controller.

3.6 Power Switches Circuit Design
After we integrated our control logic with the power switch circuit proposed in Booster
[8], the final power switch system looks like the Figure 3.21. Since when the higher
voltage is selected in the above power switches logic, the logic for the lower voltage is
also satisfied, we provide the priority for the power supplies:
V1 (Higher voltage) > V2 (Medium voltage) > V3 (Lower voltage)
Priority
Implementation

Force to use V1
Switch to V1

V1 On

||

&&

V1 On

V1

V2

V3

S1

V2 On

S2

Switch to V2

V2 On

&&

V3 On

S3

Switch to V3
Power supply
control logic

Supply Voltage output

Power supply
switch circuit [8]

SRAM
Equivalent
Circuit

L

C
R

Figure 3.21 System diagram after integrating power switch circuit

To be compatible with the old system without power switches, in addition to the power
supply priority design, we provide a user defined bit to force the system to use only V1.

27

If this bit is on, the power supply switch system will be bypassed and the system will
work on the high performance with high power consumption mode all the time.
In the logic above, there is only one power supply will be activated for power supply
output at any time. But the delay of the switches will cause all switches (S1~S3) open for
a short time. However, the capacitance of the SRAM Cell could maintain the power
supply of such a short time of main power supply shortage. You might also think the
delay of the switches could cause several switches close at the same time. But, this will
not cause any side effect of the system except the power consumption will not decrease
for this short time.
As the power switch circuit, we reference the design in the article “Booster: Reactive
Core Acceleration for Mitigating the Effects of Process Variation and Application
Imbalance in Low-Voltage Chips [8]”. The speed of the switch design is high enough to
satisfy our system. However, the delay of switching from a lower voltage to a higher one
will cause extra cache miss penalty, because the related operation must wait until the
power supply to be stable, while switching from a higher voltage to a lower one have no
such a problem because the operations work properly at a higher voltage with more
power consumption. Since cache effective access time [18,19,20] is calculated by:
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒_𝑎𝑐𝑐𝑒𝑠𝑠_𝑡𝑖𝑚𝑒 = 𝑐𝑎𝑐ℎ𝑒_𝑎𝑐𝑐𝑒𝑠𝑠_𝑡𝑖𝑚𝑒 + 𝑚𝑖𝑠𝑠_𝑟𝑎𝑡𝑒 ∗ 𝑚𝑖𝑠𝑠_𝑝𝑒𝑛𝑎𝑙𝑡𝑦
The extra cache miss penalty increases the effective access time, which implies the
efficiency of the cache access will decrease. But, in the multi-tasking operating system,
when one thread or process encounter cache miss, operating system will hang up the task
and switch to another, until the resource of hung up task is prepared, then it will be
28

woken up again. Because the L1 cache DATA RAM is always supplied by V1, a short
delay for L2, L3 cache will not affect the performance of the entire system too much,
even though this delay will lower the performance of single tasks. In this study, we did
not investigate the performance impact of the additional switching delay, and further
exploration on this topic is needed in the future.
Furthermore, our power system provides very flexible interfaces. If three power supplies
could be obtained, V1, V2, V3 connect to each power supply separately. But if there are
only two power supplies, we could connect V1, V2 or V2, V3 together to one power
supply. In the worst case, V1, V2, V3 could be connected all together to only one power
supply.

3.7 Gem5 Simulation
To evaluate the duration ratio of each power supply, we deploy Gem5 simulator [21] to
analyze the instruction trace and cache miss/hit rate and then estimate the duration ratio
of echo power supply under the SPEC benchmark. The Gem5 simulator is a modular
platform for computer-system architecture research, encompassing system-level
architecture as well as processor microarchitecture.
Using Gem5, we module a single core processor with 32K bytes L1 data cache, 32K
bytes L1 instruction cache, 256K bytes L2 cache and 2M bytes L3 cache (Appendix B).
And all the caches are 4 way-associative with 64 bytes per line. After running SPEC
benchmark, Gem5 records detailed instruction trace and cache miss/hit rate for each level
of cache. The result of Gem5 simulation shows more than 50% L1 cache hit rate and
more than 90% L2 and L3 cache hit rate on average. Then according to the logic we
29

discussed in Section 3.5, we estimate the average duration ratio of echo power supply for
the SPEC benchmarks (without considering the power switching delay). Figure 3.22 and
3.23 shows result of the average duration of SPEC benchmarks we run on Gem5.
Known the duration ratio of each power supply and their power saving ratio to the rated
power supply which is V1, we could estimate the final total power saving for each cache
layer which will be discussed in detail in the next chapter.

TAG

DATA

TAG

DATA

TAG

DATA

Figure 3.22 Average power duration ration for dynamic power of SPEC benchmark

TAG

DATA

TAG

DATA

TAG

DATA

Figure 3.23 Average power duration ration for leakage power of SPEC benchmark

30

Chapter 4: Result Evaluation
In this chapter, we will evaluate the result of our proposed method from the simulations
that we conducted in Chapter 3.

4.1 Dynamic/Leakage Power Ratio
Since when write/read operation occurs, the dynamic power dominates the power
consumption, while primary power consumption is leakage power during idle state, we
first breakdown power to dynamic and leakage power for each cache level [22]. Table 4.1
show an average power breakdown ratio when SPEC benchmarks run on Gem5. Figure
4.1 shows the corresponding bar chart.
Table 4.1 Dynamic/Leakage power breakdown

Cache Type
L1
L2
L3

Power Ratio
Dynamic (70%)
Leakage (30%)
Dynamic (20%)
Leakage (80%)
Dynamic (10%)
Leakage (90%)

Figure 4.1 Dynamic/Leakage power breakdown

31

Because all the circuit in cache consumes leakage power, the ratio of leakage power
consumed by TAG RAM and DATA RAM could be estimated by:
𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆
(𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 + 𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆)
𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆
(𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 + 𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆)
But during a read or write operation, the entire TAG RAM is accessed in parallel [23],
while only the index matched cache line is activated. Thus, the ratio of dynamic power
consumed by TAG RAM and DATA RAM is about:
𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆
(𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 + 𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 𝑖𝑛 𝑂𝑁𝐸 𝐶𝐴𝐶𝐻𝐸 𝐿𝐼𝑁𝐸 ∗ 𝑊𝐴𝑌𝑆 𝑜𝑓 𝐶𝐴𝐶𝐻𝐸)
𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 𝑖𝑛 𝑂𝑁𝐸 𝐶𝐴𝐶𝐻𝐸 𝐿𝐼𝑁𝐸 ∗ 𝑊𝐴𝑌𝑆 𝑜𝑓 𝐶𝐴𝐶𝐻𝐸
(𝑇𝐴𝐺 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 + 𝐷𝐴𝑇𝐴 𝐴𝑅𝑅𝐴𝑌 𝐵𝐼𝑇𝑆 𝑖𝑛 𝑂𝑁𝐸 𝐶𝐴𝐶𝐻𝐸 𝐿𝐼𝑁𝐸 ∗ 𝑊𝐴𝑌𝑆 𝑜𝑓 𝐶𝐴𝐶𝐻𝐸)
According to the equations above, we list the TAG RAM ratio to the total cache RAM
and to the RAM bits during one write/read operation of different size of 4 wayassociative of 32 bytes per line and 64 bytes per line in Table 4.2 and Table 4.3
individually. According to Table 4.2 and 4.3 and considering the power consumption of
peripheral circuits such as pre-charge, sense amplifier, write-driver, comparison circuit,
we assume that dynamic power of TAG RAM occupies about 80% of total dynamic
power, while 90% total leakage power is consumed by DATA RAM. As a result, the
power breakdown ratio looks like Table 4.4 and Figure 4.2 after we breakdown the power
to the TAG/DATA RAM level.
32

Table 4.2 TAG RAM ratio of 4 way-associative 32 bytes per line cache

SIZE of
DATA RAM
16K Bytes
32K Bytes
64K Bytes
128K Bytes
1M Bytes
2M Bytes
4M Bytes
8M Bytes

SIZE of
TAG RAM
20 x 512 Bits
19 x 1K Bits
18 x 2K Bits
17 x 4K Bits
14 x 32K Bits
13 x 64K Bits
12 x 128K Bits
11 x 256K Bits

TAG ratio of
Total RAM
7.25%
6.91%
6.57%
6.23%
5.19%
4.83%
4.48%
4.12%

TAG access ratio
per one operation
90.91%
95.00%
97.30%
98.55%
99.78%
99.88%
99.93%
99.96%

Table 4.3 TAG RAM ratio of 4 way-associative 64 bytes per line cache

SIZE of
DATA RAM
32K Bytes
64K Bytes
128K Bytes
1M Bytes
2M Bytes
4M Bytes
8M Bytes

SIZE of
TAG RAM
19 x 512 Bits
18 x 1K Bits
17 x 2K Bits
14 x 16K Bits
13 x 32K Bits
12 x 64K Bits
11 x 128K Bits

TAG ratio of
Total RAM
3.58%
3.40%
3.21%
2.66%
2.48%
2.29%
2.10%

TAG access ratio
per one operation
82.61%
90.00%
97.14%
99.12%
99.88%
99.74%
99.86%

Table 4.4 Power breakdown to TAG and DATA RAM

Cache Type
L1

Power Ratio
Dynamic
(70%)
Leakage
(30%)

L2

Dynamic
(20%)
Leakage
(80%)

L3

Dynamic
(10%)
Leakage
(90%)
33

Power per RAM
TAG (80%)
DATA (20%)
TAG (10%)
DATA (90%)
TAG (80%)
DATA (20%)
TAG (10%)
DATA (90%)
TAG (80%)
DATA (20%)
TAG (10%)
DATA (90%)

Leakage
Leakage

Leakage

Dynamic
Dynamic

Dynamic

Figure 4.2 Power breakdown to TAG and DATA RAM

4.2 Power Savings Ratio per Cache
Having known the dynamic/leakage power ratio, power breakdown ratio per RAM, the
duration ratio of every voltage supply, and power saving ratio of each voltage supply,
now we could calculate the power saving ratio of each cache by the following equation:

∑ 𝑅𝑑𝑦𝑛/𝑙𝑒𝑎𝑘 ∗ 𝑅𝑝𝑜𝑤𝑒𝑟 𝑝𝑒𝑟 𝑟𝑎𝑚 ∗ 𝑅𝑣𝑜𝑙𝑎𝑡𝑔𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 ∗ 𝑅𝑝𝑜𝑤𝑒𝑟 𝑠𝑎𝑣𝑖𝑛𝑔 𝑜𝑓 𝑣𝑜𝑙𝑡𝑎𝑔𝑒

Using the above equation, the ultimate power saving ratio of each cache is shown in
Table 4.5. From the table, we conclude that, after applying our power switch system into
the cache controller, about 12% of L1 power consumption could be saved, while 56% is
for L2 cache and 80% is for L3 cache. Table 4.5 also shows that our power switch
strategy works efficiently for L2 and L3 cache power saving, while it saves much less
power for L1 cache.

34

Table 4.5 Power saving ratio of every cache

Cache
Type
L1

L2

Power
Ratio

Duration
Ratio

Power
per RAM

Saving
Ratio

V2
V3
V2
Dynamic Tag (80%) 50.0% 0.0% 40.0%
(70%) Data (20%) 0.0% 0.0% 30.0%

V3
--

Power Savings Total

11.20% 11.20% 12.10%

--

0.00%

Leakage Tag (10%) 50.0% 0.0% 60.0% 90.0%
(30%) Data (90%) 0.0% 0.0% 50.0% 90.0%

0.90% 0.90%

Dynamic Tag (80%) 90.0% 0.0% 40.0%
(20%) Data (20%) 65.0% 0.0% 30.0%

--

5.76% 6.54% 56.40%

--

0.78%

0.00%

Leakage Tag (10%) 45.0% 50.0% 60.0% 90.0% 5.76% 49.86%
(80%) Data (90%) 32.5% 50.0% 50.0% 90.0% 44.10%
L3

Dynamic Tag (80%) 90.0% 0.0% 40.0%
(10%)
Data (20%) 85.0% 0.0% 30.0%
Leakage Tag (10%)
(90%)
Data (90%)

--

2.88% 3.39% 80.22%

--

0.51%

9.0% 90.0% 60.0% 90.0%

7.78% 76.83%

8.5% 90.0% 50.0% 90.0% 69.05%

4.3 Power Savings Ratio to CPU
After we calculate the power saving ratio of each cache level in the Table 4.5 to the entire
CPU, we get the CPU power saving numbers listed in Table 4.6.
Table 4.6 Power saving ratio to the CPU

Power Ratio (To the whole CPU)
L1

L2

L3

Power Saving Ratio
(To the whole CPU)

16.00%
13.10%
12.00%

-22.00%
20.00%

--10.00%

1.94%
13.99%
20.75%

Table 4.6 shows that if cache consumes 16% of total processor power and there is no L2
and L3 cache in the processor, our power switch system could save about 2% of the CPU
power, which is a frustrating result. But if L2 cache is presented, our power switch
strategy could save up to 14% of the CPU power. Furthermore, if L1, L2, L3 are all onchip, about 20% total CPU power could be saved after applying our power switch system.
35

4.5 Power Savings of Benchmarks by Gem5
According to the calculation of previous chapter, we estimate the power consumption and
power saving ratio by running SPEC benchmarks on Gem5 with 32K bytes L1 data
cache, 32K bytes L1 instruction cache, 256K bytes L2 cache and 2M bytes L3 cache. All
the caches are 4 way-associative with 64 bytes per line in our Gem5 simulation. The
simulation result is shown by Figure 4.3. In each bar group, the left column shows the
estimated breakdown of CPU power consumption before applying our power switch
logic, and the right column shows the result after our power switch logic is applied. And
the red arrows show the power saving ratio to the entire CPU power consumption after
our power switch system is applied. The result shows that more than 20% of CPU power
could be saved for all the SPEC benchmarks we ran on Gem5.

36

Power
Reduction

Power
Reduction

Power
Reduction

Power
Reduction

Power
Reduction

Power
Reduction

Figure 4.3 Power breakdown of benchmark by Gem5

Chapter 5: Future Work
Our analysis shows optimistic result for the power switch system we designed. However,
our research could go deeper.
Firstly, our discussion in this article is based on entire cache, but a common cache
consists of banks and sets. When doing read or write operation, only the matched banks
and sets will be activated. Thus, we could our switches into cache bank, cache set.
Secondly, we have not considered the detail power consumption of the peripheral circuits
of the cache such as comparison circuit, pre-charge circuit, and so on. Since these
peripheral circuits are a huge part of the cache, a deeper study on them might lead to
more power saving by these assistant circuits.
Thirdly, we have not taken into consideration the effect of the peripherals like direct
memory access (DMA) in our system. Since the DMA directly accesses the memory, to
keep the coherence between DMA buffer and the cache, the cache controller might do
actions besides of the cache miss.
Fourthly, our research is based on the power switch design in the Booster [8], and the
power switch delay and power consumption consumed by the switch itself need to be
studied further.
Finally, our simulations are based on single core processor till now and the situation of
the multi-core processor need to be further considered.

37

Chapter 6: Conclusions
In this thesis, we present a method to reduce the cache power consumption using multiple
power supplies, which makes caches works at lower voltages as much as possible. Then
we discuss the simulations and calculations we conducted to evaluate our method. We
built SRAM cell in Cadence to find proper operation voltages and to verify that read
operation and idle of SRAM can be performed at lower supplies with significant reduced
power consumption. We adjusted Cacti’s code to systemically simulate and calculate the
Read/Write power consumption of cache with different sizes and different configurations
at different power supplies. Then we built invertor ring to simulate the core of the
processor and estimate the dynamic and leakage power consumption of the core, which is
compared it to the Cacti cache simulation result. At last, we deployed Gem5 to analyze
the instruction trace and miss/hit rate of cache using SPEC benchmarks and estimate the
duration ratio of each power supply and power saving ratio of each cache.
Even though our simulations show optimistic results that more than 20% CPU power
could be saved after applying our power switch system, a lot of problems related to our
idea still open and need to be studied in depth in the future.

38

References
[1] http://flicksoftware.com/services/iot/iot-rt-graphic/
[2] https://pursuitist.com/apple-rumors-2014-include-larger-ipad/
[3] http://www.cs.utah.edu/~bojnordi/class/02-cache.pdf
[4] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen and N. P. Jouppi,
“McPAT: An integrated power, area, and timing modeling framework for multicore and
manycore architectures,” MICRO, New York, NY, 2009, pp. 469-480.
[5] K. Ananda Vardhan, “Exploiting Critical Data Regions to Reduce Data Cache Energy
Consumption,” SCOPES, 2014
[6] Luis Villa, “Dynamic Zero Compression for Cache Energy Reduction,” MICRO, 2000
[7] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0201d/I21752.html
[8] T. N. Miller, “Booster: Reactive Core Acceleration for Mitigating the Effects of
Process Variation and Application Imbalance in Low-Voltage Chips,” HPCA, 2012
[9] http://www.csl.cornell.edu/courses/ece4750/handouts/ece4750-lab3-mem.pdf
[10] L. Chang et al., “Stable SRAM cell design for the 32 nm node and beyond,” Digest
of Technical Papers. 2005 Symposium on VLSI Technology, 2005., pp. 128-129.
[11] http://www.hpl.hp.com/research/cacti/
[12] A. Teman, “Dynamic stability and noise margins of SRAM arrays in nanoscaled
technologies,” 2014 IEEE Faible Tension Faible Consommation, Monaco, 2014, pp. 1-5.
[13] J. Wang, S. Nalam and B. H. Calhoun, “Analyzing static and dynamic write margin
for nanometer SRAMs,” ISLPED, Bangalore, 2008, pp. 129-134.
[14] E. Grossar, M. Stucchi, K. Maex and W. Dehaene, “Read Stability and Write-Ability
Analysis of SRAM Cells for Nanometer Technologies,” in IEEE Journal of Solid-State
Circuits, vol. 41, no. 11, pp. 2577-2588, Nov. 2006.
[15] Hiroshi Makino, “Improved Evaluation Method for the SRAM Cell Write Margin by
Word Line Voltage Acceleration,” Published Online July 2012.
[16] S. H. Tang et al., “FinFET-a quasi-planar double-gate MOSFET,” ISSCC (Cat.
No.01CH37177), San Francisco, CA, USA, 2001, pp. 118-119.
[17] T. Matsuda et al., “A combined test structure with ring oscillator and inverter chain
for evaluating optimum high-speed/low-power operation”, International Conference on
Microelectronic Test Structures, 2003., pp. 3-84
[18] https://www.d.umn.edu/~gshute/arch/cache-performance.xhtml
39

[19] http://ece-research.unm.edu/jimp/611/slides/chap5_2.html
[20] James Dundas, “Improving data cache performance by pre-executing instructions
under a cache miss,” ICS, 1997.
[21] http://gem5.org/Main_Page.
[22] M. Powell, “Reducing Leakage in a High-Performance Deep-Submicron Instruction
Cache,” VLSI, 2002.
[23] J. J. Valls, “The Tag Filter Cache: An Energy-Efficient Approach,” PDP, 2015
[24] http://gem5.org/Running_gem5

40

Appendix A
Polynomial Curve Fitting for Cacti
In the original code of Cacti, the power supply voltage and the related current of
pMOS/nMOS are fixed. To fit the simulation requirement of our system, we use
polynomial curve fitting function (polyfit) to mimic the relationship between the Vdd and
current of pMOS/nMOS. We first build circuits shown in Figure A.1 to A.3, and then
feed the data into the 6th order polyfit function in Matlab. The polyfit function we used
are shown in Figure A.4 to A.6. After we get the coefficients of the polyfit function, we
then plot the figure of original data and the adjusted data by polyfit function. The results
are shown in Figure A.7 to A.9. Figure A.7 to A.9 show that the curve generated by
polyfit data are almost perfectly cover the one generated by original data.
Then we substitute the fixed Vdd and current with the 6th order polynomial equation we
generated above in Cacti. As a result, we could sweep the Vdd in Cacti. The polynomial
equation works properly at voltage greater than 0.5 V in Cacti. But when the Vdd reaches
value below 0.5 V, there are tiny negative value appears in the result of polynomial
equation. Since Cacti cannot handle negative number, our simulation of Cacti stops at 0.6
V, which ensures Cacti work correctly.

41

Figure A.1 Circuit to measure leakage current of nMOS on

Figure A.2 Circuit to measure leakage current of nMOS off

Figure A.3 Circuit to measure leakage current pMOS on

Figure A.4 Ployfit function of nMOS on

Figure A.5 Ployfit function of nMOS off

42

Figure A.6 Ployfit function of pMOS on

Figure A.7 Ployfit vs original curve of nMOS on

Figure A.8 Ployfit vs original curve of nMOS off

Figure A.9 Ployfit vs original curve of pMOS on

43

Appendix B
Gem5 Simulator Usage
B.1 Gem5 Compiling
To compile Gem5, you first need to install all the dependence package listed in [21].
Then in the Gem5 root directory, run the following command.
scons ./build/X86/gem5.opt
The intermediate files will be cached by the compiler after the first time successfully
compiling. If you want to re-build the whole system, the ./build folder should be deleted.

B.2 Run Benchmarks on Gem5 [24]
We one of the SPEC benchmarks – art here as an example to explain how to run the
benchmarks on Gem5 in System Call Emulation. The complete command is:
./build/X86/gem5.opt --outdir=./spec configs/example/se.py --cpu-type=detailed --caches
--l1d_size=32kB --l1d_assoc=4 --l1i_size=32kB --l1i_assoc=4 --l2cache -l2_size=256kB --l2_assoc=4 --l3cache --l3_size=2MB --l3_assoc=4 --mem-size=2GB -I
10000000000 -c /spec2000_x86/spec2000/benchspec/CFP2000/179.art/src/art -o 'scanfile /spec2000_x86/spec2000/benchspec/CFP2000/179.art/src/c756hel.in -trainfile1
/spec2000_x86/spec2000/benchspec/CFP2000/179.art/src/a10.img
-trainfile2
/spec2000_x86/spec2000/benchspec/CFP2000/179.art/src/hc.img -stride 2 -startx 110 starty 200 -endx 160 -endy 240 -objects 10'
The parameters in red are the cache configurations we used for our simulation. The -I
parameters specify the max instruction will be executed. And the statistics information
will be stored in the folder specified by --outdir parameter.
Gem5 simulator also supports full system mode. This mode simulates a complete system
which provides an operating system based simulation environment. To run benchmarks in
44

full system mode, we first need to build the system image which includes the binary of
the benchmarks, and then run the benchmark through .rcS file specified by --script
parameter. The complete command is:
./build/X86/gem5.opt configs/example/fs.py --kernel=x86_64-vmlinux-2.6.22.9.smp -script=/Gem5RcsScript/fluidanimate_4c_simmedium_ckpts.rcS --caches -l1d_size=32kB --l1i_size=32kB --l2cache --l2_size=256kB --l3_size=2048kB --diskimage=x86root-parsec-new.img --mem-size=2GB

45

Cache Power Optimization,

Yan, M.S. 2017

46

