University of Texas at El Paso

ScholarWorks@UTEP
Open Access Theses & Dissertations
2022-08-01

Investigating the Effects of Decoupling Cache and Core Speed on
Power, Throughput, and Energy
David Daniel Pruitt
University of Texas at El Paso

Follow this and additional works at: https://scholarworks.utep.edu/open_etd
Part of the Computer Sciences Commons

Recommended Citation
Pruitt, David Daniel, "Investigating the Effects of Decoupling Cache and Core Speed on Power, Throughput,
and Energy" (2022). Open Access Theses & Dissertations. 3621.
https://scholarworks.utep.edu/open_etd/3621

This is brought to you for free and open access by ScholarWorks@UTEP. It has been accepted for inclusion in Open
Access Theses & Dissertations by an authorized administrator of ScholarWorks@UTEP. For more information,
please contact lweber@utep.edu.

INVESTIGATING THE EFFECTS OF DECOUPLING CACHE AND CORE SPEED ON
POWER, THROUGHPUT, AND ENERGY CONSUMPTION

DAVID DANIEL PRUITT
Doctoral Program in Computer Science

APPROVED:

Eric Freudenthal, Ph.D., Chair

Shirley Moore, Ph.D.

Art Duval, Ph.D.

Luc Longpre, Ph.D.

Stephen L. Crites, Jr., Ph.D.
Dean of the Graduate School

Copyright ©

by
David Pruitt
2022

Dedication
To the friends that made this process possible

“It is possible to commit no errors and still lose. That is not a weakness. That is life.”

INVESTIGATING THE EFFECTS OF DECOUPLING CACHE AND CORE SPEED ON
POWER, THROUGHPUT, AND ENERGY CONSUMPTION
OF CPUS

by

David Daniel Pruitt BS CS MS CS

DISSERTATION

Presented to the Faculty of the Graduate School of
The University of Texas at El Paso
in Partial Fulfillment
of the Requirements
for the Degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science
THE UNIVERSITY OF TEXAS AT EL PASO
August 2022

Acknowledgements
I would like to thank everyone that assisted during my long and arduous journey in completing
this project. I’m sure I’ll missing some names here. I would like to thank the various members of
the Robust Autonomic systems group for their support over the years. Gabe Arellano for sharing
writing tips. Edward Dragone and Edward Hudgins for their help in various technical areas.
Adrian Veliz for forging the path and advising of the landmines that appear. Daniel Cervantes for
allowing me to bounce ideas off him and his support when I was trying to teach. I would also
like to thank the unofficial members, David Reyes and Oscar Veliz for answering the stupid
questions I had with the appropriate vigor.
I would also like to thank Eric Freudenthal and Salamah Salamah for providing assistance, useful
advice, and direction when needed that made this journey possible. I want to thank Dr Shirly
Moore for working through ideas, spotting problems, and providing helpful suggestions. I want
to thank Dr Art Duval and Luc Longpre for taking the time out of their busy schedules to read
over drafts and provide feedback.
“You ever figure procrastination is your brain’s way of stopping you from making a terrible
mistake? Yeah... Me too.” Cayde-6

v

Abstract
A variety of computer systems from HPC to mobile systems are power limited and
performance sensitive. These systems use very similar components at different scales. Dynamic
Voltage and Frequency Scaling (DVFS) features enable modulation of CPU performance and
efficiency characteristics to power, energy and timing requirements.
Programs have a variety of computational characteristics. If a CPU subsystem substantially
limits a particular program’s execution progress, that program’s throughput will vary
proportionally with the subsystem’s clock frequency. In contrast, if a CPU subsystem does not
substantially limit throughput, the impact of a change in its clock frequency will result in a
diminimus change in a program’s execution time.
Dynamic Voltage and Frequency Scaling (DVFS) power domains commonly encompass
entire cores and their associated caches. This work indicates that moderate energy efficiency gains
may be attainable for some programs if limiting and non-limiting subsystems’ (D)VFS domains
are decoupled. This decoupling enables tuning of their relative performance to application
characteristics.

Widely used simulation and modeling tools were extended to support this

exploratory research.

vi

Table of Contents
Dedication ...................................................................................................................................... iii
Acknowledgements ..........................................................................................................................v
Abstract .......................................................................................................................................... vi
Table of Contents .......................................................................................................................... vii
List of Tables ................................................................................................................................. ix
List of Figures ..................................................................................................................................x
List of Equations ............................................................................................................................ xi
Chapter 1: Introduction ....................................................................................................................1
1.1 Motivation .........................................................................................................................1
1.2 Strategy .............................................................................................................................2
1.3 Intuition .............................................................................................................................2
1.4 Results ...............................................................................................................................3
1.5 Contributions.....................................................................................................................4
Chapter 2: CPU Power and Energy Consumption ...........................................................................5
2.1 Introduction .......................................................................................................................5
2.2 Relationship of frequency, Throughput, and Efficiency ...................................................5
2.3 Device Power Consumption .............................................................................................6
2.3.1 Active Power .........................................................................................................6
2.3.2 Background Power ................................................................................................7
2.5 Active Power Proportion...................................................................................................7
2.6 Power Domains and Decoupling ......................................................................................8
2.7 Other Approaches .............................................................................................................8
Chapter 3: Methodology ................................................................................................................10
3.1 Overview .........................................................................................................................10
3.2 Benchmark Kernels .........................................................................................................12
3.3 Hardware Event Measurement ........................................................................................13
3.4 Power Modelling .............................................................................................................14
3.4.1 Walker’s Model ..................................................................................................15
vii

3.4.2 Event Selection ...................................................................................................15
3.4.4 Validation Study .................................................................................................17
3.5 Gem5 ...............................................................................................................................17
Gem5 Validation ..........................................................................................................19
Gem5 Modifications to Support Decoupling ...............................................................20
Chapter 4 Experimental Results and Analysis ...............................................................................22
4.1 Introduction .....................................................................................................................22
4.2 L2 Limited Study ............................................................................................................25
4.3 Core Limited Study .........................................................................................................29
4.4 Balanced Study ...............................................................................................................33
4.5 Synopsis and Potential Extensions of this Work ............................................................35
4.5.1 Open Questions ...................................................................................................36
4.5.1.1 Slower clock frequencies ........................................................................36
4.5.1.2 Decoupling Additional Subsystems ........................................................36
4.5.1.3 Dynamic Optimization ............................................................................36
4.5.1.4 Thread Interactions .................................................................................36
4.5.1.5 Speculative Execution .............................................................................37
References ......................................................................................................................................38
Appendix A Benchmarks ...............................................................................................................41
Appendix B Gem5 Modifications ..................................................................................................42
Appendix C Power Measurement ..................................................................................................43
Vita 44

viii

List of Tables
Table 3.1 Benchmark Kernels....................................................................................................... 13
Table 3.2 Power Model Events and Coefficients.......................................................................... 17
Table 3.3 Power Model Validation Results .................................................................................. 17
Table 3.4 Gem5 CPU Model Differences ..................................................................................... 18
Table 3.5 Gem5 Memory Model Differences ............................................................................... 19
Table 3.6 L2 Cache Performance Comparison, TX1-A model vs Actual .................................... 19
Table 3.7 L2 Cache Performance Comparison, TX1-Final model vs Actual ............................... 20
Table 4.1 Most Efficient Configuration and Energy Summary .................................................... 22
Table 4.2 Frequency Notation....................................................................................................... 23
Table 4.3 Benchmark Speedup ..................................................................................................... 25
Table A.1 Benchmark Sources ..................................................................................................... 41
Table A.2 Benchmark configurations ........................................................................................... 41

ix

List of Figures
Figure 3.1 Methodology................................................................................................................ 11
Figure 4.1 Power and Throughput of L2 Applications ................................................................. 26
Figure 4.2 Energy Breakdown of L2 Applications ....................................................................... 27
Figure 4.3 L2 Energy Contour Plot L2 Applications Unicore ...................................................... 28
Figure 4.4 L2 Energy Contour Plot L2 Applications Multicore ................................................... 29
Figure 4.5 Power and Throughput of Core Applications .............................................................. 31
Figure 4.6 Energy Breakdown of Core Applications.................................................................... 31
Figure 4.7 Energy Contour Plot Core Applications Unicore ........................................................ 32
Figure 4.8 Energy Contour Plot Core Applications Multicore ..................................................... 32
Figure 4.9 Power and Throughput of Balances Applications ....................................................... 34
Figure 4.10 Energy Breakdown of Balanced Applications .......................................................... 34
Figure 4.11 Energy Contour Plot Balanced Applications Unicore ............................................... 34
Figure 4.12 Energy Contour Plot Balance Applications Multicore .............................................. 35

x

List of Equations
Equation 3.1 Power Consumption Model ..................................................................................... 15

xi

Chapter 1: Introduction
1.1 MOTIVATION
Modern computer systems from HPC to mobile systems are power limited and
performance sensitive. These systems use very similar components at different scales [1].
Furthermore, application-specific CPUs may be optimized for energy efficiency.

For

example, CPUs and their subsystems can be selected and clocked to provide required performance.
Even instruction sets are optimized to satisfy embedded systems’ power, performance, and energy
requirements.
(D)VFS features enable modulation of CPU performance and efficiency characteristics to
better match system power and energy limitations, and application constraints. Higher frequency
provides greater throughput at the cost of increased power consumption. When sufficient
parallelism is exposed by the application and available from a system, optimization of efficiency
via DVFS can also maximize throughput. Energy consumed by computing components is
classified as either active or background. Active energy consumption is due to gate state changes
on the computational data or control path. Background energy is consumed by leakage current
and activity required to keep the system “alive.”
The throughput of a CPU-limited computation is limited by the rate of gate transitions.
The energy required for a gate transition activity is monotonic with voltage. Voltage is monotonic
with frequency. Therefore, energy required to actively compute a result is monotonic with
frequency. And therefore, if all energy went to active gate transitions, a computation would be
most efficient at lowest available frequency.
Background energy consumption is the integral of background power consumption over
the time required to complete a computation. Background power varies only 20% over the range

1

of frequencies examined. As a result, background energy consumption increases when execution
time increases.
Execution throughput varies monotonically with frequency. If a subsystem substantially
limits execution progress, throughput can vary proportionally with frequency, resulting in almost
inversely proportional change in execution time and background energy consumption. In contrast,
if a subsystem does not substantially limit throughput, the impact of a change in its clock frequency
will result in a diminimus change in its execution time. The change in background energy
consumption will be dominated by the resulting change in background power draw.

1.2 STRATEGY
DVFS power domains commonly encompass entire cores and their associated caches. This
work examines the extent that such apps can be executed more efficiently if the limiting and nonlimiting components are independently clocked.
The independent clocking of L2 and core allows these systems’ performance characteristics
to be tuned to match an app’s behavior.

1.3 INTUITION
The intuition motivating this strategy of independently clocking L2 and core is that (1)
some apps’ throughput are relatively insensitive to either core or L2 delays, and that (2) total
energy consumption can be reduced by reallocating power away from that subsystem without
substantially increasing execution time.
Power consumption and throughput is monotonic with frequency, and energy consumption
is the integral of power over time. This reallocation of power can potentially enable the limiting
2

subsystem to execute more quickly, thus hastening the app’s completion. Alternatively, this
strategy can reduce the power required of an execution of (approximately) the same duration.
This approach of independently clocking subsystems can potentially be applied to other functional
units.
To examine this phenomenon, widely used performance simulation and power modeling
tools were modified and calibrated to support this independent clocking
To facilitate comparison over a variety of architectural parameters and algorithms, we
define a unit-independent energy efficiency ratio (EER). This ratio is normalized to system
efficiency at the most efficient configuration we examined. For example, if an application has an
EER of 1.2 for a particular (L2 frequency, Core Frequency) pairing, it is 20% less efficient than
the most efficient pairing examined in this study.
Performance characteristics and EERs are examined for variety of computational
benchmarks and scientific proxy apps. Apps classified by primary throughput limitations over
the range of investigate clock frequencies.

1.4 RESULTS
Performance characteristics and EERs are examined for variety of computational
benchmarks and scientific proxy apps. Apps classified by primary throughput limitations over
the range of investigate clock frequencies.
Our results show that decoupling L2 cache from core speeds increases efficiency up to 10%
for some configurations. L2 limited apps show a 4%-10% improvement in efficiency with
decoupled clocks. Core applications are at most 2% more efficient with decoupled clocks, some
apps are most efficient at coupled speeds

3

All apps are most efficient when active power comprises 15 to 30% of total energy
consumption. When active power is less than this range, then efficiency is gained when a moderate
increase in active power yields substantial throughput. When system power is higher than this
range, power, efficiency is increased when active power is substantially reduced, ideally in
subsystems with minimal limitation of throughput, sometimes to very low frequencies.
CPU cores consist of multiple functional units that could be clocked independently for
additional energy savings. Frequencies might be automatically tuned to provide best efficiency
using event counters and power measurements.
1.5 CONTRIBUTIONS
The primary contribution of this research is the characterization of computational
efficiency effects of independently varying the clock frequencies of core and L2 caches in an
energy-efficient superscalar system. Extensions of the Wilson’s power-modeling algorithm and
the widely used Gem5 simulation tools were developed to support systems with decoupled core
and cache frequencies. This effort also included the identification of efficiency and power
consumption metrics suitable to supporting this analysis and a systematic strategy for identifying
throughput-limiting subsystems.

4

Chapter 2: CPU Power and Energy Consumption

2.1 INTRODUCTION
We define efficiency as the amount of energy used during the execution of a program.
Thus, a more efficient system uses less energy per computation and has a lower efficiency metric
(lower is better).
The energy consumed by a computation is the integral of power over that computation’s
execution. Electrical power is the product of current and voltage, varies with workload, and is
frequently limited by device heat dissipation limits. We define throughput to be the reciprocal of
a program’s execution time. CPUs are synchronous devices whose throughput and power
consumption increases monotonically with clock frequency.
This section examines nuances in the relationship between clock frequency and total
system energy consumption over a computation’s lifetime relevant to this research.

2.2 RELATIONSHIP OF FREQUENCY, THROUGHPUT, AND EFFICIENCY
As a synchronous system, a CPU’s throughput is proportional to clock frequency if a
computation limited computation’s throughput is solely dependent on CPU delays.
CPUs interact with other subsystems such as memory whose throughput and latencies
generally cannot be modulated with CPU frequency. Those limits frequently limit a system’s
effective computational throughput, especially at high CPU clock frequencies. This is a wellknown phenomenon and has been the subject of substantial research since the development of
multitasking. For example, memory performance increases have not kept pace with CPU
performance increases. If a computation is memory intensive its throughput will be limited to
memory performance, even if the CPU is clocked at its maximum frequency [2].
5

This work examines the efficiency implications of independently modulating core and l2
frequencies. We examine applications that fit in L2 cache and therefore effectively decouple
execution throughput from external delays, such as from memory.
We classify these apps whose memory footprint fits within L2 cache into three categories:
apps whose throughput is primarily limited by (1) L2 cache or (2) core, and (3) those whose
throughput is substantially limited by both.

2.3 DEVICE POWER CONSUMPTION
Recall that energy consumption is the integral of power consumption of the lifetime of a
computation. Power consumption is the sum of active and background components with differing
sensitivities to clock frequency.
CPUs and memory devices are composed of logic gates.

Increasing throughput by

increasing throughput drives requires the gates to switch faster. Faster switching times require
increased voltage and current, and therefore power. Both active and inactive gates also draw a
leakage current that is also monotonic in voltage. The relationship between the minimum voltage
required to support some switching rate is approximately linear with that rate.
As described in Section 3.4, overall system power consumption is modeled using Wilson’s
strategy. This strategy partitions of power consumption into active and passive components.
2.3.1 Active Power
Active power is power consumed by gates that are performing a computation. The amount
of energy consumed by a gate transition is quadratic with voltage [3]–[6]. Since the rate of
transitions is (approximately) proportional to frequency, the power they consume is cubic in
frequency. If all power consumption were active, computations would be most energy-efficient at
a frequency low enough to be computable at the device’s lowest voltage.
6

2.3.2 Background Power
Background power is composed of static and background dynamic power. Static power is
a result of leakage current, which is an artifact of IC design leakage current is approximately
proportional to voltage (or frequency) [5][3], [4].
Background dynamic power is from gate transitions that support computation but are not
on the data path. These include things such as processor management and instruction scheduling.
As with active power, background dynamic power is cubic in frequency.

2.5 ACTIVE POWER PROPORTION
The interaction of frequency and app energy consumption is nuanced. If background power
would be insignificant and active power dominated, the highest efficiency would be at lowest
voltages and therefore low frequencies. In contrast, if background power dominated, then a race
to completion strategy of operating at high frequency would be most efficient.
This research includes efficiency measurements collected from an aggressively energyoptimized reference system developed by nVidia that utilizes their TX1 high performance CPU
[7]. We observe that, for this system, background power consumption dominates at low clock
frequencies, that active power dominates at high clock frequencies, and the clock frequency that
minimizes energy consumption varies by app.
Recall that dynamic power is cubic with frequency and static power is linear. Thus, the
proportion of active power increases with frequency. To facilitate analysis, we define the active
power proportion (APP) as the proportion of total system power draw that is classified as active.

7

2.6 POWER DOMAINS AND DECOUPLING
Throughput and/or energy consumption can be optimized by matching the performance
characteristics of major subsystems of the computing system to characteristics of the program
being executed. For example, application throughput has varying degrees of sensitivity to L2
cache and core frequency. A natural way to modulate the power and throughput of a synchronous
(sub)system is to modulate its clock frequency. (D)VFS power domains commonly encompass
entire cores and their associated caches, which are all driven by a common clock.
We hypothesize that partitioning the DVFS domain and adjusting the frequency of L2 and
core independently would improve efficiency by allowing a CPU to match an application’s needs.
By reducing the frequency (and therefore power consumption) of non-limiting components will
result in greater reduction in power consumption than throughput.

2.7 OTHER APPROACHES
Industry has employed several approaches to reduce the background power from underutilized functional units.
Power and Clock Gating circuity that shuts-down unused functional units. For example,
several Intel processors shutdown unused portions of vector units or parts of caches when they are
unused [8]–[10]. This gating provides some adaptiveness to app characteristics. However, this
level of granularity may be too coarse to yield substantial benefit. For example, a lightly loaded
pipeline is active even though energy consumed by only a few of its stages are contributing to
throughput. Thus, energy consumed by inactive stages can contribute to both static and dynamic
background power consumption.
Gating has an interesting interaction with DVFS and Overprovisioning. Many CPUs are
overprovisioned in that their heat dissipation limits are exceeded if operated at higher clock
8

frequencies when all functional units are active. To prevent thermal overload and reduce power
consumption under light workloads, CPUs dynamically modulate their clock frequency.

We

observe that this model optimizes throughput rather than efficiency.
Big-Little CPUs: Big-Little CPUs contain both energy-hungry (big) high performance
cores and more energy-efficient (little) lower performance cores that implement the same ISA.
These designs enable threads to be assigned to these cores based on performance and energy
constraints [11]–[13]. This achieves related effects to those proposed using a different mechanism
that is more expensive, complicated, and provides more limited control.

9

Chapter 3: Methodology
3.1 OVERVIEW
A system of simulation and modeling tools were adapted to determine the relative
throughput, power consumption, and efficiency of kernels executed on systems with coupled and
decoupled clocks. The overall computational workflow including tools and strategies used to tune
them are illustrated in Figure 3.1 Methodology. As illustrated at the top of this figure, the five
process stages involved are:
1. Collection of data about the TX1 CPU needed to calibrate the simulation and modeling
tools.
2. Calibration of timing simulation and power modelling tools using data from Stage 1.
3. Extending the timing simulation and power modeling tools created in Stage 2 to support
decoupled clocks.
4. Timing simulation and power modeling of the same kernels executed in Stage 1 using
a variety of coupled and decoupled clock frequencies using the tools created in stage 3
and event counts collected in Stage 1.
5. Computation of energy consumption and efficiency from power draw and timing data
generated in Stage 4.

10

1
Data collection on TX1
with coupled CPU

2
Model
Construction

3
Model
Decoupling

Calibration

Published
Hardware
Specifications

Kernels running on
hardware

Simulator
Modification

Gem5
Hardware
Models

Grouping
Events

Execution
Time
Power

Power Model

4
Simulation and
Modelling of
decoupled system

Decoupled
Timing
Simulation
(Gem5)

5
Energy and
Efficiency
Computation

Predicted
Time

Energy
Decoupled
Power Model
(Walker)

Predicted
Power

Hardware
Counts

Figure 3.1 Methodology
An Nvidia TX1 development board is used as a power-efficient coupled-clock high
performance reference for this work.
This system is suitable because it incorporates a commodity A57 CPU [14] and LPDDR4
[15] memory found in many energy-sensitive high performance system systems.
The TX1 implements an Arm A57 CPU implements features found in other highperformance processors such as superscalar and out of order execution. The Arm A57 CPU is
commonly used in both energy constrained and compute constrained environments such as the
Nintendo Switch, Google Pixel C, and autonomous driving systems The Arm A57’s caches and
cores are driven by a single coupled clock [16].
The TX1 system uses standard LPDDR4 memory. LPDDR4 is the low-power variant of
DDR4 [17], [18]. DDR4 is a memory interface commonly used in high performance systems. Its
successor (DDR5) was announced in 2020 and was not incorporated within systems until late 2021.
LPDDR4 memory is incorporated in a variety of energy-constrained performance-sensitive game
consoles smartphones, and tablets [19].

11

3.2 BENCHMARK KERNELS
Recall that our objective is determine the impact of decoupled core and L2 cache speeds
on energy efficiency. To isolate this phenomenon, apps from a variety of problem domains with
the following characteristics were chosen as benchmark kernels:
Execution time from 2 to 10 seconds on actual hardware; shorter times would have
unrepresentative data while long times result in excessive simulation times
Program’s working set fits within L2 cache: This work focuses on core and L2 limited
applications, larger working set sizes would result in memory throughput and latency limited
applications outside the scope of this work
Ability to statically link without modification: This is motivated by multiple pragmatic
considerations. It is critical that the same code execute on the physical TX1 system and under
simulation.

Dynamic linking can inadvertently import different variants of the same library.

Static linking eliminates this risk. Furthermore, dynamic linking is not among the system services
emulated by Gem5, and their direct simulation would dramatically increase the four to thirty hours
required to simulate benchmark execution.
The set of benchmarks selected appears in Table 3.1 Benchmark Kernels. The first column
indicates a characterization of execution behavior described in 4.1 IntroductionChapter 4. Input
parameters and configurations are listed in Appendix A. Entries on the last six rows of this table
are microbenchmarks used only for power and energy calibration. In order to provide a wide range
of workload, some of these benchmarks were executed with multiple parameter sets. The
parameter sets are described in Appendix A Benchmarks.

12

Table 3.1 Benchmark Kernels

3.3 HARDWARE EVENT MEASUREMENT
The power models are driven by event rates derived from hardware performance counters.
Calibration of the power model was performed using power, rate, and time data collected from
benchmark executions on the TX1 evaluation board. The Linux perf tool was used to collect
hardware counts and execution times. Power data was collected using a custom Arduino based
measurement system
Linux’s widely used Perf instrumentation tool [20], [21] was used to collect execution time
and event counts for all benchmarks executing on a TX1 system at ten different frequencies. Since
there are more event types than hardware counters, multiple instrumentation runs counting
different subsets of the event types were required for full coverage. This process was completed
for each frequency within the range of frequencies used in this work.
Power data was gathered using a custom Arduino based power measurement system. Power
measurements were taken between the power supply and the TX1 development board. Generating
13

an accurate power model requires data from kernels with a variety of instruction mixes running at
many different frequencies. Each kernel was executed across the entire range of frequencies
examined.
Event counts appear to be insensitive of (coupled) clock frequency. There was minimal
(less than 1%) variation among event counts for the same program executed across the range of
frequencies.
These counts are used to (1) calibrate the power model for coupled systems and (2) model
the power consumption of decoupled systems. This is reasonable because, in both cases, the same
program’s execution is being simulated and modeled.

While execution timing may vary

significantly, the same set of operations are performed in (approximately) the same order.
The number and types of speculatively scheduled but discarded operations varies between
coupled and decoupled system. These could not be included in our model, and neither the Gem5
simulator nor the available counters expose events necessary to support this analysis.

3.4 POWER MODELLING
The power modeling method developed by Walker and utilized by others [22], [23] was
extended to support decoupled clocks. In this model, the energy consumption of functional unit
operations are constant when clock rate and voltage are fixed. Variation in dynamic power draw
due to voltage is modeled quadratically. To extend Walker’s model to support modelling systems
with independent cache and core speeds, event counters were classified as relevant to either L2
cache or other CPU functions.

14

3.4.1 Walker’s Model
Walker models power consumption of a system with coupled clocks [23] (see Equation 3.1
Power Consumption Model). In this model n is a hardware event, E is the rate at which that event
is occurring (ie additions per second), β is the coefficient, VDD is the operating voltage, and fclk is
the operating frequency. For notational convenience, we refer to V2f as the system’s power factor.
This model assumes that the power consumption of a gate transition scales linearly with V2f and
reflects electrical characteristics of CMOS circuitry.

𝑁−1
2
2
𝑃𝑐𝑙𝑢𝑠𝑡𝑒𝑟 = (∑ 𝛽𝑛 𝐸𝑛 𝑉𝐷𝐷
𝑓𝑐𝑙𝑘 ) + ⏟
𝛽𝑏 𝑉𝐷𝐷
𝑓𝑐𝑙𝑘 + ⏟
𝑓(𝑉𝐷𝐷 )
⏟𝑛=0
𝐵𝐺 𝑑𝑦𝑛𝑎𝑚𝑖𝑐
𝑠𝑡𝑎𝑡𝑖𝑐
𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦

Equation 3.1 Power Consumption Model
In this model, when voltage and frequency are constant, a functional unit’s operation rate
is proportional to its power consumption. This strategy correlates measured power consumption
of various apps with their event counts at a range of clock frequencies. This correlation yields an
overall system power consumption model that is linear with operation rates collected from
performance counters incorporated within CPUs.

3.4.2 Event Selection
The power model was derived from executions of 15 kernels that exercise a variety of
subsystems (see Table 3.1 Benchmark Kernels). The TX1 supports over 50 different hardware
event counters [24] that can be used as parameters for the model.
The power draw of each of fifteen applications was measured across ten frequencies,
yielding a total of 150 observations. If all fifty counters were included, there would only be three

15

observations per parameter. Harrell and others observe a minimum of ten observations per
parameter is needed to avoid overfitting [25], [26].
Many of these counters measure overlapping events. These counters would correlate with
one another resulting in an unstable model.
Walker [22], and later, Reddy [23] constructed models that only incorporate five to seven
counters. We implement their greedy strategy to progressively grows the set of selected event
counters that monotonically improve the model’s fit with measured power consumption. This
algorithm terminates when the addition of any unselected counter causes overfitting as indicated
by an adjusted r2 metric.

3.4.3 Modifications to Support Decoupling
All functional units in a system with a coupled clock are driven synchronously.

Since

power factor (see Section 3.4 Power Modelling) is derived from frequency, in a coupled system,
the same power factor applies to all its functional unit.
In contrast, in a decoupled system, (groups of) functional units are operating at different
frequencies and therefore have differing power factors. To model total device power draw when
L2 and core are clocked at differing frequencies, these power factors must be independently
associated with the active and passive consumption of each of these subsystems.
As tabulated in Table 3.2, event types selected using Walker’s algorithm are segregated by
relevance to the independently clocked L2 and core subsystems. Rates of each group are then
scaled by a power factor computed from that group’s frequency.
Static power is dependent on the area a component occupies. To calculate the static power
use of a modified system, the static power from the unmodified CPU are combined in proportion
to their area.
16

Event types utilized in our model, their classification, and their coefficients are tabulated
in Table 3.2 Power Model Events and Coefficients.

Table 3.2 Power Model Events and Coefficients
Description
Event
Class
Static
Intercept Background
Cycles
r11
Background
L2 read
r50
L2
L1 refills due to write
r43
Non-L2
SIMD instruction speculatively executed r74
Non-L2
Store instruction speculative executed
r71
Non-L2
Bus accesses not including L2 refill
r19-r52 Non-L2
Instructions executed
r8
Non-L2

Coeff
2.49

1.36E-19
1.64E-18
5.93E-17
1.99E-19
-2.87E-19
3.15E-18
1.51E-19

3.4.4 Validation Study
Our coupled power model was validated using the same technique developed by Walker
by applying standard statistical means and comparing predicted to measured values. As listed in
Table 3.3 Power Model Validation Results. The adjusted R2 and mean average percent error
(MAPE) were slightly worse than published models while collinearity, as measured using variance
inflation factor (VIF), was slightly better.
Table 3.3 Power Model Validation Results
Our Model
Walker A7
Walker A15
Reddy

Adj R^2 MAPE
VIF
0.92
4.55
0.99
2.81
0.99
3.79
0.98
5.9

2.8
3.04
4.94
2.9

3.5 GEM5
Predictions for execution time are generated by the flexible and extensible Gem5 fullsystem simulator [27]. Gem5 is widely used for architectural research (the ACM digital library
contains over 1000 references to Gem5). Validated Gem5 parameter sets have been published that
17

enable it to closely model timing of commodity CPUs including the ARM family investigated in
this research [28], [29]. To enable this research, features were added to Gem5 that enable
simulation of systems with independent cache and core speeds.
Recall that the Nvidia TX1 system consists of many subsystems including memory, CPU
cores, and caches. Gem5 has modules to simulate these components. Our TX1-A parameter set
was synthesized from Gem5’s existing Exynos 5 (based on Arm A15) and LPDDR3 models and
adjusting their parameters to correspond to published specifications of the A57 CPU and of the
LPDDR4 memory components in the TX1 [15], [24], [30]. Table 3.4 and Table 3.5 summarize the
differences between the CPU and memory models respectively.

Table 3.4 Gem5 CPU Model Differences
Parameter class
Instruction latency

Internal structures
Cache sizes
Cache specifications

Parameter
Integer Multiply
Floating point add
Floating point square root
Floating point multiply
SIMD Compare
SIMD Conversino
SIMD Other
SIMD Multiply
SIMD Shift Accumulate
SIMD Square root
SIMD Floating point add
SIMD Floating point comparison
SIMD Floating point conversion
SIMD Floating point other
SIMD Floating point multiply
SIMD Floating point multiply accumulate
SIMD Floating point square root
Load
Store
dispatch width
writeback width
L1 instruction
TLB Cache
L1 Instruction latency
TLB Associativity

18

A15 value A57 value
(existing)
(new)
4
3
6
5
33
12
8
5
4
3
3
5
3
4
6
4
3
4
9
5
6
5
3
5
3
5
3
5
6
5
1
9
9
5
2
1
2
1
6
3
8
16
32KiB
48KiB
1KiB
2KiB
2
1
8
4

Table 3.5 Gem5 Memory Model Differences
Parameter class
Operation Latency

Configuration

Parameter
tCK
tCL
tWR
tBurst
tWTR
tCS
tFAW
tXS
module size
row buffer
banks per rank
channels

LPDDR3
(existing)
0.625ns
15ns
18ns
5ns
7.5ns
2.5ns
50ns
140ns
512MiB
4kB
8
-

LPDDR4
(new)
1.25ns
17.5ns
15ns
2.5ns
10ns
1.25ns
40ns
137.5ns
256MiB
2kB
16
2

Gem5 Validation
Validation of the Gem5 models was performed by comparing the execution times and
execution counts of simulated execution against those measured on TX1 hardware.
When cache speed is coupled to core frequency, cache throughput is proportional to clock
frequency. Measurements on the TX1 testbed confirmed this behavior as show in table 3.6. The
Gem5 models demonstrated this characteristic for sequential memory accesses, but not for random
memory access.

Table 3.6 L2 Cache Performance Comparison, TX1-A model vs Actual
Random access

Sequential access

Simulated (MiB/s)
Actual (MiB/s)
Error
Simulated (MiB/s)
Actual (MiB/s)
Error

1034 MHz 1730 MHz
588
636
384
641
53%
-1%
8822
14784
8703
14409
1%
3%

This is a frequency dependent error. It therefore is likely a result of interactions with some
component whose timing is not derived from the common clock.

19

All exposed model parameters related to cache operation and interaction reference the
common clock and therefore would proportionally affect cache performance at all clock
frequencies.
To compensate for this frequency-dependent error, alternative values for a single parameter
limiting the number of concurrently outstanding requests was varied with frequency. These values
are tabulated in Table 3.7 and the results in Table 3.8.
Table 3.7 Frequency Dependent MSHR Table
Frequency MSHRs
710
6
826
6
922
6
1037
6
1133
6
1224
7
1326
7
1428
8
1556
8
1632
8
1734
8

Table 3.8 L2 Cache Performance Comparison, TX1-Final model vs Actual
Random access

Sequential access

Simulated (MiB/s)
Actual (MiB/s)
Error
Simulated (MiB/s)
Actual (MiB/s)
Error

1034 MHz 1730 MHz
400
636
384
641
4%
-1%
8822
14784
8703
14409
1%
3%

When validated across all applications the maximum observed error is 27%, the median
error in our validation experiments is 8%, which compares favorably with 20% reported by for
Gem5 configurations utilized in similar architectural studies [29], [31].

Gem5 Modifications to Support Decoupling
Gem5’s simulation is built around caches that are tightly integrated with cores. Within
Gem5’s parameter sets, cache timings are specified in terms of core clock cycles. Therefore, to
20

implement a cache with independent speeds, these timings need to be scaled to the appropriate
values for a given cache and core speed combination. Gem5’s L2 cache instantiation code was
modified to calculate and set the timings based on the L2 clock frequency being simulated.
Appendix B Gem5 Modifications describes the changes to this instantiation code.

21

Chapter 4 Experimental Results and Analysis
4.1 INTRODUCTION
For notational convenience, we normalize energy consumption by the energy consumption
by the most efficient configuration. Thus, the most efficient configuration will have a relative
energy metric (REM) of 1.0, and a configuration with an energy metric of 1.2 consumes 20% more
energy executing the same app.
The energy contour plots of Figures 4.3, 4.4, 4.7, 4.8, 4.11, and 4.12 indicate the relative
energy consumed by an execution of each app over the full range of examined core and L2
frequencies. The most efficient coupled configurations are indicated by a filled circle on the
diagonal, which is denote by a dotted line. When a decoupled configuration is more efficient, it is
indicated by an “x.”

Isopleth lines denote 3% efficiency reductions from the most efficient

configuration. Quiver arrows indicate gradient direction and amplitude.

Table 4.1 Most Efficient Configuration and Energy Summary
Core limited

L2 limited

Balanced

CoMD Small
CoMD Large
ATLAS Small
ATLAS Med
ATLAS Large
HPCG
PageRank Kro Sm
PageRank Kro Lg
PageRank Uniform
SSSP Uniform
SSSP Kronecker
Himeno Small
Himeno Extra Small
BFS Uniform
BFS Kro

1 core
Active Power (APP)
Most Eff Min Freq Max Freq
EER
16%
3%
17% 1.01
14%
3%
18% 1.01
31%
9%
38% 1.00
25%
9%
37% 1.00
29%
8%
35% 1.00
16%
7%
30% 1.06
21%
8%
31% 1.04
14%
8%
28% 1.04
18%
9%
31% 1.03
16%
10%
33% 1.06
16%
8%
29% 1.06
25%
6%
27% 1.08
22%
7%
29% 1.10
18%
9%
33% 1.05
19%
9%
36% 1.03

4 core
Best Efficiency
Active Power (APP)
Best Efficiency
Coupled Decoupled Most Eff Min Freq Max Freq EER
Coupled Decoupled
15
12,17
21%
15%
35%
1.00
11
11,11
15
12,15
21%
14%
35%
1.00
11
11,11
15
15,15
36%
36%
59%
1.00
9
9,9
13
13,13
33%
33%
60%
1.00
9
9,9
15
15,15
30%
30%
57%
1.00
9
9,9
15
17,10
29%
23%
46%
1.08
9
15,9
12
15,12
30%
22%
43%
1.10
9
15,9
12
15,9
27%
21%
41%
1.12
9
15,9
12
15,10
28%
21%
41%
1.11
9
15,9
12
16,9
33%
28%
49%
1.08
9
15,9
12
17,10
30%
24%
46%
1.08
9
15,9
15
17,15
26%
20%
41%
1.11
9
15,9
15
15,13
24%
20%
43%
1.07
9
15,9
12
17,10
31%
27%
48%
1.04
9
15,9
12
17,10
31%
28%
51%
1.01
9
13,9

Efficiency Summary Table 4.1 Most Efficient Configuration and Energy Summary
indicates the most efficient coupled and decoupled frequency pairing for each app when executed
mono-programmed on a single core and multi-programmed on four cores.

22

The range of clock frequencies examined extend over approximately a factor of two. For
notational convenience, these frequencies are represented as low valued integers (see Table 4.2
Frequency Notation). Decoupled configurations are represented by pairs of integers. Frequencies
at minimum or maximum limit are typeset in bold font. For decoupled configuration, the first
integer indicates L2 Frequency, the second integer represents core frequency. Observe that when
decoupling provides no efficiency advantage, the coupled EER metric equals 1.0.

Table 4.2 Frequency Notation
Frequency MHz
Notation
8
826
9
922
10
1037
11
1133
12
1224
13
1326
14
1428
15
1556
16
1632
17
1734
All apps consume the least energy consumption when the active power proportion (APP)
is between 15 and 30 percent.

APP varies among apps, is higher when the device is multi-

programmed, and increases with frequency.
Generally, apps with lower APP are most efficient at higher (but not the highest)
frequencies. Conversely, apps with higher APP are most efficient at lower (but not the lowest)
frequencies.
Decoupling improved efficiency up to 9% for L2 limited and balanced apps. Most corelimited apps are most efficient when executed with coupled clocks. One core-limited app was
slightly (1-2%) more efficient with decoupled clocks, and only when executed on a single core.
23

L2 limited and balanced apps are most efficient when L2 is clocked at 75% or more of the
maximum frequency available. This is because their throughput is sensitive to L2 delays, and L2
power consumption increases slowly with clock frequency. L2 power consumption never exceeds
11% of total, even when L2 is clocked at maximum frequencies.
Core limited apps are most efficient at low coupled speeds in multi core configuration due
to the amortization of system energy consumption over multiple cores and a high APP. In single
core configuration the best efficiency is at about 80% max core frequency
This section also references bottleneck identification studies that identify apps whose
execution speed is principally limited by either L2 or core. In these studies, where execution time
on a coupled system with a low (800MHz) clock frequency is compared with execution time on
systems where either or both clock frequencies are approximately doubled to 1730 MHz. The
throughput of apps classified as L2 or core-limited are highly sensitive to the frequency of only
one of these clocks. Those whose throughput are similarly sensitive to both are classified as
(approximately) balanced.
Recall that all these apps are CPU bound in that, when clocks are coupled, throughput is
close to proportional with clock frequency, as shown in Table 4.3 Benchmark Speedup.
To determine the sensitivity of each app to L2 and core clock frequencies, we examine the
reduction of throughput when either L2 or core is clocked at minimum frequency with the other
set to maximum frequency.

24

Table 4.3 Benchmark Speedup
Type

app
CoMD small
CoMD large
Core
ATLAS Small
Limited
ATLAS Medium
ATLAS Large
HPCG
PageRank Kro
Small
PageRank Kro
Large
PageRank
L2
Limited Uniform
SSSP Uniform
SSSP Kro
Himeno Small
Himeno Extra
Small
BFS Kro
Balanced
BFS Uniform

Speedup Diff from ideal
2.03
-3%
2.04
-3%
2.03
-3%
2.05
-3%
2.01
-4%
1.83
-13%
1.82

-13%

1.98

-6%

1.80
1.79
1.81
1.90

-14%
-15%
-14%
-10%

1.90
1.94
1.86

-10%
-7%
-11%

4.2 L2 LIMITED STUDY
As indicated in Figure 4.1 Power and Throughput of L2 Applications, the throughput of
L2-limited apps are 50% more sensitive to L2 frequency than core frequency. As anticipated by
the hypothesis motivating this research, the throughput limiting L2 cache is clocked at higher
frequencies than core at the most efficient configuration.

25

As illustrated in Figures ENERGY-SURFACE-L2 mono, multi , all the L2-limited apps
are 4%-8% more efficient in decoupled configurations with higher L2 than core frequency. In all
configurations, L2’s active energy consumption is lower than core, and, for the most efficient
configuration, APP varies from fifteen to 20 percent (see Figure Figure 4.2 Energy Breakdown of
L2 Applications).
Observe that the most efficient decoupled configurations’ core frequency is at the
minimum 700 MHz setting for these L2-limited apps when executed with multiprogramming.

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

HPCG Unicore

HPCG Multicore

PageRank Kr Sm Unicore

PageRank Kr Sm Multicore

PageRank Kr Lg Unicore

PageRank Kr Lg Multicore PageRank Uni Unicore

PageRank Uni Multicore

SSSP Kro Unicore

SSSP Kro Multicore

SSSP Uniform Unicore

SSSP Uniform Multicore

Himeno Sm Unicore

Himeno Sm Multicore

Himeno XSmall Unicore

Himeno XSmall Multicore

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Figure 4.1 Power and Throughput of L2 Applications

26

L2 apps
Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

HPCG Unicore

HPCG Multicore

PageRank Kr Lg Unicore

PageRank Kr Lg Multicore PageRank Uni Unicore

PageRank Uni Multicore

SSSP Kro Unicore

SSSP Kro Multicore

SSSP Uniform Unicore

SSSP Uniform Multicore

Himeno Sm Unicore

Himeno Sm Multicore

Himeno Lg Unicore

Himeno Lg Multicore

PageRank Kr Sm Unicore

PageRank Kr Sm Multicore

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Figure 4.2 Energy Breakdown of L2 Applications

27

HPCG

Himeno Small

Himeno Extra Small

PR Kronecker Small

PR Kronecker Large

PR Uniform

SSSP Kronecker

SSSP Uniform

Figure 4.3 L2 Energy Contour Plot L2 Applications Unicore

28

HPCG

Himeno Small

Himeno Extra Small

PR Kronecker Small

PR Kronecker Large

PR Uniform

SSSP Kronecker

SSSP Uniform

Figure 4.4 L2 Energy Contour Plot L2 Applications Multicore
4.3 CORE LIMITED STUDY
As illustrated in Figure 4.5 Power and Throughput of Core Applications, throughput of
core-limited apps is at least seven times more sensitive to L2 core frequency than L2 frequency.
Approximately doubling of core frequency alone yields ~55-100% increases in throughput for
29

core-limited apps. In contrast, a configuration which approximately doubles only L2 frequency
only yields two to eight percent higher throughput.
As illustrated in Figure 4.6 Energy Breakdown of Core Applications, core-limited apps are
most efficient with higher core than L2 frequencies when executed on only one core. This is
consistent with our hypothesis that efficiency can be increased by speeding up throughput-limiting
subsystems. When multiprogrammed on four cores (see Figure 4.8 Energy Contour Plot Core
Applications Multicore), core-limited apps are most efficient at low coupled frequencies. We
observe that configurations with higher core speeds would have APP outside of the 15-30 percent
that appears to correspond to high efficiency. This may be due to DVFS limitations as that, for
these devices whose L2 caches and cores are driven by the same clock, do not reduce voltage for
frequencies below 700 Hz. For all core-limited apps examined, decoupling L2 and core frequency
yields no more than 1% improvement of efficiency.
For all core-limited apps, decoupling L2 and core frequency yields no more than 1%
improvement of efficiency. When running on one core marginally faster core than L2 frequencies
provide improved efficiency due to the high proportion of static power. When running multicore
L2 static power is amortized among 4 times as many cores and eliminates the benefits of
decoupling.

30

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

CoMD Small Unicore

CoMD Small Multicore

CoMD Large Unicore

CoMD Large Multicore

ATLAS Small Unicore

ATLAS Small Multicore

ATLAS Med Unicore

ATLAS Med Multicore

ATLAS Large Unicore

ATLAS Large Multicore

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Figure 4.5 Power and Throughput of Core Applications

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

CoMD Small Unicore

CoMD Small Multicore

CoMD Large Unicore

CoMD Large Multicore

ATLAS Small Unicore

ATLAS Small Multicore

ATLAS Med Unicore

ATLAS Med Multicore

ATLAS Large Unicore

ATLAS Large Multicore

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Figure 4.6 Energy Breakdown of Core Applications
31

ATLAS Small

ATLAS Medium

CoMD Small

CoMD Large

ATLAS Large

Figure 4.7 Energy Contour Plot Core Applications Unicore

ATLAS Small

ATLAS Medium

CoMD Small

CoMD Large

ATLAS Large

Figure 4.8 Energy Contour Plot Core Applications Multicore
32

4.4 BALANCED STUDY
As illustrated in Figure 4.9 Power and Throughput of Balanced Applications, balanced
apps have similar throughput sensitivity to cache and core frequencies. When cache and core
speeds are decoupled, balanced apps are 1-5% more efficient in both uni and multi core
configurations (see Figure 4.11 Energy Contour Plot Balanced Applications Unicore and Figure
4.12 Energy Contour Plot Balance Applications Multicore).
The most efficient decoupled single core configurations consume one to five percent less
energy. When multiprocessing these apps, decoupling yields less (one to three percent) energy
savings.
APP for all the balanced apps at the most efficient coupled frequency is approximately
18% for unicore configurations. When decoupled APP increases by 1-2%. For multicore operation
APP increases to a minimum of 25% for coupled and 33% for decoupled configurations. This
occurs at the lowest examined frequency.

This suggests that efficiency might be improved if

lower frequencies with lower voltages are available.
Throughput sensitivity to doubled L2 frequency is higher (see Figure 4.9 Power and
Throughput of Balanced Applications) when multitasking. studies that compare throughput with
either subsystem’s clock is approximately doubled indicate that the faster L2 yields 5% When just
cache speed is doubled throughput is within 5% of that when just core speed is doubled. When
running multicore, balanced apps become more sensitive increases in cache speed since cache
resources are shared across apps. However, throughput increases due to cache frequency are within
15% of throughput increases due to core frequency.
Active power increases significantly when core frequency is increased. Throughput
sensitivity to cache and core frequency is similar, thus best efficiency is achieved with cache

33

clocked at higher frequencies than core. When running multicore active core power dominates at
higher frequencies. Best Efficiency is at lower overall frequency with cache being clocked at a
~50% higher frequency than core.

Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

BFS Uniform 1 core

BFS Uniform 4 core

BFS Kronecker 1 core

BFS Kronecker 4 core

Figure 4.9 Power and Throughput of Balanced Applications
Most Eff
Most Eff C
Fast Coup
Fast Core
Fast Cache
Slow Coup

Figure 4.10 Energy Breakdown of Balanced Applications

BFS Kronecker

BFS Uniform

Figure 4.11 Energy Contour Plot Balanced Applications Unicore

34

BFS Kronecker

BFS Uniform

Figure 4.12 Energy Contour Plot Balance Applications Multicore
4.5 SYNOPSIS AND POTENTIAL EXTENSIONS OF THIS WORK
The purpose of this work is to examine the potential efficiency benefits of decoupling the
speeds of various CPU systems. This allows a CPU to balance subsystem performance and power
consumption characteristics to match the running application. This work is a limited case study
that examines the decoupling of cache and core speeds.
As anticipated by our motivating hypothesis, decoupling of L2 cache and core speeds
provided up to 10% efficiency improvements for L2 limited applications and up to a 2%
improvement for core limited applications. In systems where clock frequency is limited by a power
cap, the increased efficiency provided by decoupling can potentially enable increased throughput

35

versus a coupled configuration. We observe that efficiency provided by decoupling is reduced
when the apps are executed on multiple cores.

4.5.1 Open Questions
Below are several potential extensions to this research.
4.5.1.1 Slower clock frequencies
Many of the most efficient configurations are at the lowest frequency in the frequencyvoltage scaling range of the TX1 CPU.

Core active power consumption dominated L2 power

consumption in these instances. This suggests that alternative implementations that extend the
CPU core’s frequency range to lower frequencies might yield even greater efficiency.

4.5.1.2 Decoupling Additional Subsystems
CPU cores contain many synchronously clocked subsystems. Decoupling clocks or other
throughput scaling of those other systems may enable further tuning of CPU efficiency to match
app characteristics.

4.5.1.3 Dynamic Optimization
We observe that system efficiency as a function of L2 and core frequency is a convex
surface.

This suggests that systems could potentially employ hill-climbing strategies to

autonomically optimize those parameters while monitoring execution throughput already exposed
through event counters. Furthermore, this autonomic control could potentially be implemented
within an operating system, or perhaps, like DVFS, within a CPU.

36

4.5.1.4 Thread Interactions
Concurrent threads of a parallel program executing on multiple cores communicate and
coordinate via memory (and caches).

These interactions may substantially impact energy

consumption and throughput, and therefore the energy required to complete a computation. Thread
interactions were not examined in the present research due to limitations of available simulation
tools and libraries.

4.5.1.5 Speculative Execution
Observe that some apps have higher throughput when L2 speed is slower and core speed
is held constant. This paradoxical phenomenon likely is due to the differences in memory latency
triggering different sets and sequences of speculatively issued operations, such as memory fetches.
For example, speculative memory references can increase contention for shared communication
channels or evict cache entries. The confounding influence of speculative execution has been
observed by others [prefetch-pref][prefetch-work]. Further examination of speculative execution
and decoupling may reveal potential optimization strategies.

37

References
[1]
[2]
[3]
[4]
[5]

[6]
[7]

[8]

[9]
[10]
[11]

[12]
[13]
[14]

“About Fugaku | RIKEN Center for Computational Science RIKEN Website.”
https://www.r-ccs.riken.jp/en/fugaku/about/ (accessed Jul. 12, 2022).
Z. Qureshi, V. S. Mailthody, S. W. Min, I.-H. Chung, J. Xiong, and W. Hwu, “Tearing
Down the Memory Wall,” Aug. 2020, doi: 10.48550/arxiv.2008.10169.
J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, 3rd ed. Upper
Saddle River, NJ, USA: Prentice Hall Press, 2008.
A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,”
IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473–484, Apr. 1992, doi:
10.1109/4.126534.
A. Pallipadi and A. Starikovskiy, “The ondemand governor: past, present and future,”
Proceedings of Linux Symposium Volume Two, pp. 215–230, 2006, Accessed: Nov. 23,
2013. [Online]. Available:
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:The+ondemand+govern
or:+Past,+present,+and+future#1
“Zombie Kid Likes Turtles - YouTube.”
https://www.youtube.com/watch?v=CMNry4PE93Y (accessed Jul. 14, 2022).
“NVIDIA® JetsonTM TX1 Supercomputer-on-Module Drives Next Wave of Autonomous
Machines | NVIDIA Technical Blog.” https://developer.nvidia.com/blog/nvidia-jetsontx1-supercomputer-on-module-drives-next-wave-of-autonomous-machines/ (accessed Jul.
13, 2022).
M. H. Taufique, A. Okpisz, H. N. Ahmed, J. R. Riley, M. M. Hasan, and G. Gerosa, “A
512-KB level-2 cache design in 45-nm for low power IA processor silverthorne,” in 2008
IEEE Custom Integrated Circuits Conference, Sep. 2008, no. Cicc, pp. 403–406. doi:
10.1109/CICC.2008.4672105.
J. McCalin, “AVX2 optimized code execution time deviation,” 2016.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/703317 (accessed Dec.
07, 2017).
A. Fog, “Agner`s CPU blog - Test results for Broadwell and Skylake,” 2015.
http://www.agner.org/optimize/blog/read.php?i=415 (accessed Dec. 07, 2017).
ARM, “big.LITTLE Technology: The Future of Mobile,” 2013. Accessed: Jul. 06, 2022.
[Online]. Available:
https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf%0A
http://cs.txstate.edu/~zz11/cs7333/lec_proj/paper/big_LITTLE_Technology_the_Futue_of
_Mobile.pdf%0Ahttps://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of
_Mobile.pdf
H. Chung, M. Kang, and H.-D. Cho, “Heterogeneous multi-processing solution of Exynos
5 Octa with ARM big. LITTLE technology,” Samsung White Paper, 2012.
“Game Dev Guide for 12th Gen Intel Core Processor Hybrid Architecture,” 2022.
https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-coreprocessor-gamedev-guide.html (accessed Jul. 06, 2022).
Nvidia, “DATA SHEET NVIDIA Tegra X1 Series Processors Maxwell GPU + ARM v8
Description The NVIDIA ® Tegra ® X1 series SoC couples the latest NVIDIA Maxwell,”
2014.

38

[15]
[16]
[17]
[18]

[19]
[20]
[21]
[22]
[23]

[24]
[25]
[26]

[27]
[28]
[29]

[30]
[31]

M. Technology, “Mobile LPDDR4 SDRAM MT53B256M32D1.” Micron Technology,
2014.
“Nintendo Switch Teardown - iFixit.”
https://www.ifixit.com/Teardown/Nintendo+Switch+Teardown/78263 (accessed Jul. 12,
2022).
Vadhiraj Sankaranarayanan, “Which DDR SDRAM Memory to Use and When,” 2019.
“LPDDR SDRAM vs. DDR SDRAM: What exactly are the differences these days?,” Nov.
29, 2021.
https://www.reddit.com/r/hardware/comments/r5bqo0/lpddr_sdram_vs_ddr_sdram_what_
exactly_are_the/ (accessed Jul. 14, 2022).
“Pixel C specifications - Pixel Help.”
https://support.google.com/pixel/answer/6328677?hl=en (accessed Jul. 12, 2022).
“Linux perf Examples.” https://www.brendangregg.com/perf.html (accessed Jun. 12,
2022).
D. Zaparanuks, M. Jovic, and M. Hauswirth, “Accuracy of performance counter
measurements,” ISPASS 2009 - International Symposium on Performance Analysis of
Systems and Software, pp. 23–32, 2009, doi: 10.1109/ISPASS.2009.4919635.
M. J. Walker et al., “Accurate and Stable Run-Time Power Modeling for Mobile and
Embedded CPUs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 36, no. 1, pp. 106–119, 2017, doi: 10.1109/TCAD.2016.2562920.
B. K. Reddy, M. J. Walker, D. Balsamo, S. Diestelhorst, B. M. Al-Hashimi, and G. v.
Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator,” 2017
27th International Symposium on Power and Timing Modeling, Optimization and
Simulation, PATMOS 2017, vol. 2017-Janua, pp. 1–8, 2017, doi:
10.1109/PATMOS.2017.8106988.
Arm, ARM® Cortex® -A57 MPCore Processor Technical Reference Manual. Arm
Limited, 2014.
F. E. Harrell, K. L. Lee, R. M. Califf, D. B. Pryor, and R. A. Rosati, “Regression
modelling strategies for improved prognostic prediction,” Statistics in Medicine, vol. 3,
no. 2, pp. 143–152, Apr. 1984, doi: 10.1002/SIM.4780030207.
P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstem, “A simulation
study of the number of events per variable in logistic regression analysis,” Journal of
Clinical Epidemiology, vol. 49, no. 12, pp. 1373–1379, 1996, doi: 10.1016/S08954356(96)00236-3.
J. Lowe-Power et al., “The gem5 Simulator: Version 20.0+,” Jul. 2020, doi:
10.48550/arxiv.2007.03152.
A. Tousi and C. Zhu, “Arm Research Starter Kit : System Modeling using gem5,” 2022.
A. Butko et al., “Full-System Simulation of big.LITTLE Multicore Architecture for
Performance and Energy Exploration,” Proceedings - IEEE 10th International Symposium
on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2016, pp. 201–208, Dec.
2016, doi: 10.1109/MCSOC.2016.20.
ARM., “Cortex®-A57 Software Optimization Guide.” Arm Limited, p. 42, 2016.
[Online]. Available: https://developer.arm.com/documentation/uan0015/b/
A. Butko, A. Gamatié, G. Sassatelli, L. Torres, and M. Robert, “Design exploration for
next generation high-performance manycore on-chip systems: Application to big.LITTLE

39

[32]
[33]
[34]
[35]
[36]

architectures,” Proceedings of IEEE Computer Society Annual Symposium on VLSI,
ISVLSI, vol. 07-10-July, pp. 551–556, 2015, doi: 10.1109/ISVLSI.2015.28.
J. Mohd-yusof, “CoDesign Molecular Dynamics ( CoMD ) Proxy App Deep Dive,” Oct.
2012.
R. K. Karmani et al., “ATLAS (Automatically Tuned Linear Algebra Software),”
Encyclopedia of Parallel Computing, pp. 95–101, 2011, doi: 10.1007/978-0-387-097664_85.
M. A. Heroux and Jack. Dongarra, “Toward a new metric for ranking high performance
computing systems.,” Jun. 2013, doi: 10.2172/1089988.
S. Beamer, K. Asanović, and D. Patterson, “The GAP Benchmark Suite,” Aug. 2015, doi:
10.48550/arxiv.1508.03619.
“Himeno benchmark | ISC, RIKEN.”
https://i.riken.jp/en/supercom/documents/himenobmt/ (accessed Jul. 14, 2022).

40

Appendix A Benchmarks
This Appendix lists the applications used in this work and the commands used to run them.
Several of these applications use data saved in files. These files are available on the project’s
website.
Table A.1 Benchmark Sources
App
CoMD
ATLAS
HPGC
PageRank
SSSP
BFS
Himeno

Cite
[32]
[33]
[34]
[35]
[35]
[35]
[36]

Source
github.com/exmatex/CoMD
math-atlas.sourceforge.net/
github.com/hpcg-benchmark/hpcg
github.com/sbeamer/gapbs
github.com/sbeamer/gapbs
github.com/sbeamer/gapbs
github.com/kowsalyaChidambaram/Himeno-Benchmark

Version/commit
3d48396
3.10.3
e3fe052
120cd01
120cd01
120cd01
2dbea18

Table A.2 Benchmark configurations
App Name
CoMD Small
CoMD Large
ATLAS Small
ATLAS Medium
ATLAS Large
HPGC
PageRank KR Small
PageRank KR Large
PageRank Uniform

command
CoMD-serial -e -x 20 -y 20 -z 20 -N 10
CoMD-serial -e -x 10 -y 10 -z 10 -N 40
xdl3blastst -m 200 -n 200 -# 300 -C 1 -T 0
xdl3blastst -m 600 -n 600 -# 10 -T 0
xdl3blastst -m 1000 -n 1000 -# 10 -T 0
xhpcg 32 32 32
pr -f kronecker_19.sg -n 5
pr -f kronecker_21.sg -n 1
pr -f uniform_19.sg -n 5

SSSP Uniform

sssp -f uniform_20.wsg -n 1

SSSP Kronecker
Himeno Extra Small
Himeno Small
BFS Uniform
BFS Large

sssp -f kronecker_19.wsg -n 4
himenobmtxpa xs
himenobmtxpa s
bfs -f kronecker_21.sg -n 10
bfs -f uniform_20.sg

41

comments
MPI version with DO_MPI=OFF
MPI version with DO_MPI=OFF

Prebuilt Kronecker graph
Prebuilt Kronecker graph
Prebuilt uniform random graph
Prebuilt uniform random
weighted graph
Prebuilt uniform random
weighted graph

Prebuilt Kronecker graph
Prebuilt uniform random graph

Appendix B Gem5 Modifications
This appendix lists the changes made to the source code of Gem5 to model the nVidia TX1
system and to allow independent specification of cache and core speeds. The models for the TX1
included a CPU model for the Arm A57 and a model for LP-DDR4 memory used. The parameters
for the CPU were placed in the O3_TX1.py. The parameters for the memory were added to Gem5’s
existing list of memory in DRAMCtrl.py. Extending Gem5 to support decoupled required changes
to two files CacheConfig.py and Options.py. CacheConfig.py contains the changes to the
instantiation procedure for L2 caches while Options adds the command line option for independent
cache

speed.

Full

source

code

including

https://github.com/daviddpruitt/gem5-decoupled

42

changes

is

available

from:

Appendix C Power Measurement
The power data used for the linear regression model in this work was collected using an
Arduino based power logging system as shown in figure A.1. A script was used to run each kernel
for at least 1 minute at every frequency the TX1 starting at 700MHz. The scripts included
synchronization and calibration data for use in offline analysis.
To avoid inefficiencies in the power supply, current and voltage were measured between
the power supply and TX1 development board. Current was measured by a hall effect current
sensor. A low pass filter was used to ensure a clean signal. Voltage was directly measured by the
Arduino. Data logs from the Arduino were saved to a server for offline processing.

TX1 Power
Supply

Hall Effect
Current Sensor

Voltage Sensor

TX1
Development
board

Low Pass Filter

Arduino Current and
Voltage Logging Board

Current and
Voltage Logs

Figure A.1 Power Logging Setup

43

File Server

Vita
David Pruitt graduated from the University of Texas at El Paso with a Bachelor of Science
in Computer Science in 2011. He received his Masters of Science in Computer Science in 2016.
If you’re reading this, he received his Doctor of Philosophy in Computer Science in 2022.
When he entered graduate school in 2013 he joined the Robust Autonomic systems group.
There he worked on several projects involving the Android operating system. During this time he
was a teaching assistant for many of the classes within the Computer Science Department. Once
he started his Doctorate, he started teaching a variety of classes including Computer Architecture,
Operating Systems, and Parallel Programming. During his doctorate he had internships at Oak
Ridge National Lab researching the effectiveness of running applications on GPUs.

Contact Information: ddpruitt@miners.utep.edu

44

