Memory Interference Characterization and Mitigation for Heterogeneous Smartphones by SHINGARI, DAVESH (Author) et al.
Memory Interference Characterization and Mitigation for Heterogeneous
Smartphones
by
Davesh Shingari
A Thesis Presented in Partial Fulfillment
of the Requirements for the Degree
Master of Science
Approved August 2016 by the
Graduate Supervisory Committee:
Carole-Jean Wu, Chair
Sarma Vrudhula
Aviral Shrivastava
ARIZONA STATE UNIVERSITY
December 2016
ABSTRACT
The availability of a wide range of general purpose as well as accelerator cores on
modern smartphones means that a significant number of applications can be executed
on a smartphone simultaneously, resulting in an ever increasing demand on the mem-
ory subsystem. While the increased computation capability is intended for improving
user experience, memory requests from each concurrent application exhibit unique
memory access patterns as well as specific timing constraints. If not considered, this
could lead to significant memory contention and result in lowered user experience.
This work first analyzes the impact of memory degradation caused by the inter-
ference at the memory system for a broad range of commonly-used smartphone appli-
cations. The real system characterization results show that smartphone applications,
such as web browsing and media playback, suffer significant performance degradation.
This is caused by shared resource contention at the application processor’s last-level
cache, the communication fabric, and the main memory.
Based on the detailed characterization results, rest of this thesis focuses on the
design of an effective memory interference mitigation technique. Since web browsing,
being one of the most commonly-used smartphone applications and represents many
html-based smartphone applications, my thesis focuses on meeting the performance
requirement of a web browser on a smartphone in the presence of background processes
and co-scheduled applications. My thesis proposes a light-weight user space frequency
governor to mitigate the degradation caused by interfering applications, by predicting
the performance and power consumption of web browsing. The governor selects an
optimal energy-efficient frequency setting periodically by using the statically-trained
performance and power models with dynamically-varying architecture and system
conditions, such as the memory access intensity of background processes and/or co-
scheduled applications, and temperature of cores. The governor has been extensively
i
evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By
operating at the most energy-efficient frequency setting in the presence of interference,
energy efficiency is improved by as much as 35% and with an average of 18% compared
to the existing interactive governor, while maintaining the satisfactory performance
of web page loading under 3 seconds.
ii
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to my advisor, Dr. Carole-Jean Wu for
her continuous support, motivation, and guidance. Dr. Wu has been a great source of
inspiration with her hard work and passion for computer architecture. She kindled my
interest in research and has helped me sustain that interest by offering challenging
problems to work on. I am deeply grateful for the weekly meeting and periodic
discussions that immensely helped me in questioning my thoughts and exploring new
domains. I am also thankful to her for encouraging the use of correct grammar and
for carefully reading and commenting on countless revisions of this manuscript. This
thesis would not have been possible without her support and mentoring.
Besides my advisor, I am grateful to my committee members Dr. Sarma Vrudhula
and Dr. Aviral Shrivastava for their encouragement, insightful comments and valuable
suggestions that have made my thesis better.
I want to thank my fellow labmates and friends Akhil, Benjamin, Chetan, Nishant,
Shin and Vignesh for helping me stay motivated through tougher times.
Most importantly, I am thankful to my family for their vital and unrelenting
support.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Characterization of Memory Interference for Heterogeneous
Smartphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Memory Interference Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 SoC Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Web Page Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 MEMORY INTERFERENCE CHARACTERIZATION. . . . . . . . . . . . . . . . . 12
3.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Device Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Benchmarks and Input Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Scenario 1: Application Processor Memory Interference . . . . . . . . . . . . 17
3.2.1 Sustained Computation-bound Smartphone Workloads . . . . . . 17
3.2.2 Interactive, Real-time Smartphone Workloads . . . . . . . . . . . . . . 19
3.3 Scenario 2: Memory Interference in the Shared Communication
Fabric and Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Sustained Computation-bound Smartphone Workloads . . . . . . 22
iv
CHAPTER Page
3.3.2 Interactive, Real-time Smartphone Workloads . . . . . . . . . . . . . . 23
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 MEMORY INTERFERENCE MITIGATION . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Real Device Measurement Infrastructure . . . . . . . . . . . . . . . . . . . 28
4.1.2 Web Page Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Interfering Application Characteristics . . . . . . . . . . . . . . . . . . . . . 29
4.1.4 Multiprogrammed Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.5 Model Parameters and Configuration . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Power and Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Model-based Web Page Load Time Prediction . . . . . . . . . . . . . . 33
4.3.2 Dynamic and Leakage Power Prediction . . . . . . . . . . . . . . . . . . . 34
4.4 Performance and Power Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Performance Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Power Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Evaluation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Performance and Energy Efficiency Trends . . . . . . . . . . . . . . . . . 39
4.5.2 The Adaptive Nature of DORA . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.3 Interaction of DORA with Memory Interference Intensity . . . 45
4.5.4 Impact of Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.5 Interaction of DORA with varying QoS . . . . . . . . . . . . . . . . . . . . 47
4.5.6 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
v
CHAPTER Page
4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
APPENDIX
A PERFORMANCE AND POWER MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B WORKLOADS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
C DEVICE LAYOUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vi
LIST OF TABLES
Table Page
3.1 Device Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Descriptions for the Applications and the Corresponding Application
Domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Workload Combinations for Scenario 1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Workload Combinations for Scenario 2a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Workload Combinations for scenario 2b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Web Page Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Interfering Application Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 List of Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Performance and Power Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1 Frequency Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.2 List of Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1 Workloads used for DORA Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.2 Workloads used for DORA Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vii
LIST OF FIGURES
Figure Page
1.1 Performance Degradation Experienced by Browser. . . . . . . . . . . . . . . . . . . . 2
1.2 Impact of Memory Interference on Webpage load time . . . . . . . . . . . . . . . . 4
1.3 Energy-Efficient Frequency Setting (fopt) for Reddit . . . . . . . . . . . . . . . . . . 5
2.1 Typical Heterogeneous Mobile SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Scenario 1a: Application Processor Memory Interference . . . . . . . . . . . . . . 17
3.2 Scenario 1b: Performance Degradation S-curve for Web Browsing . . . . . . 20
3.3 Scenario 1b: Performance Degradation S-curve for Media Player . . . . . . . 21
3.4 Scenario 2a: Memory Interference in the Shared Communication Fabric
and Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Scenario 2b: Performance Degradation of Interactive, Real-Time Smart-
phone Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 DORA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Cumulative Distribution of Prediction Errors for Performance Models . . 38
4.3 Cumulative Distribution of Prediction Errors for Power Models . . . . . . . . 39
4.4 Average Energy Efficiency and Web Page Load Time Comparison of
DORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Energy Efficiency Comparison of DORA with other Governors - 1 . . . . . 42
4.6 Energy Efficiency Comparison of DORA with other Governors - 2 . . . . . 43
4.7 Interaction of DORA with Varying Co-scheduled Application Memory
Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8 Impact of Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 DORAs Frequency Selection for Different QoS Targets . . . . . . . . . . . . . . . . 47
5.1 IR Camera Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
C.1 Nexus 5 PCB layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
viii
Figure Page
C.2 Nexus 5 PCB Internals - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
C.3 Nexus 5 PCB Internals - 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
Chapter 1
INTRODUCTION
There has been an explosive growth in the use of mobile computing devices, es-
pecially smartphones, for our everyday computing needs. According to International
Telecommunication Union, there were nearly 7 billion mobile subscriptions as of May
2014 [11]. To meet the high performance expectation from users, more and more
application processor cores as well as graphics processing unit (GPU) cores are being
integrated into the system-on-chip (SoC) in each generation of smartphone chips.
The availability of a wide range of general-purpose as well as accelerator cores
means that a significant number of applications can be now executed on a smart-
phone simultaneously, resulting in an ever-increasing demand on the main memory.
For example, a typical smartphone multi-program use case consists of file download-
ing via the Wi-Fi/LTE antenna, music playback via a specific accelerator, and web
browsing on the application processor cores, generating a heterogeneous combination
of memory requests at the main memory. Furthermore, as GPUs are becoming more
and more programmable, general-purpose computations are increasingly offloaded to
mobile GPUs, e.g., [23], to achieve higher performance and improved energy efficiency.
Similarly, more and more accelerators have been added to modern smartphone SoCs
to execute special functions as energy efficiently as possible, e.g., Qualcomm’s pro-
grammable digital signal processor (DSP) released in 2013, and many more other
low-power accelerators in the years to come [26, 43]. Android 4.2.2 and iOS 9 op-
erating systems support multiprogrammed features such as screen sharing between
multiple applications. In multi-threaded applications, computations are distributed
between processor cores and programmable accelerators. Both of these give rise to
1
11.1
1.2
1.3
1.4
W
e
b
 B
ro
w
si
n
g 
P
e
rf
o
rm
an
ce
 S
lo
w
d
o
w
n
 
w
it
h
 C
o
-S
ch
e
d
u
le
d
 M
o
b
ile
 A
p
p
s
3.0 seconds
Figure 1.1: A significant 10-33% performance slowdown is experienced by user-centric,
interactive web browsing that is co-scheduled with other mobile workloads.
increased contention for caches and main memory. Such sharing of resources, if not
properly controlled, can result in significant performance degradation, e.g, missing
deadlines, which is particularly problematic for real-time, interactive applications, as
this directly manifests as lower user satisfaction.
To quantify the degree of performance degradation caused by the interference in
the shared cache as well as in the main memory, I construct a number of scenarios
that represent different user behavior on smartphones. The characterization results
in Figure 1.1 show that a significant 10-33% performance slowdown on user-centric,
interactive web browsing is experienced when it is co-scheduled with other mobile
workloads, ranging from concurrent file download, transfer, and compression to sci-
2
entific algorithms that are fundamental building blocks of future mobile workloads.
Some degree of the performance degradation can be tolerated without sacrificing
user satisfaction while other performance degradation directly violates user-centric
deadlines—user satisfaction is determined by the absolute execution time. According
to recent studies based on large mobile web user experiences, 40% of users aban-
don web pages that do not load within 3 seconds [44]. In other words, the 3-second
mark is the performance quality-of-service (QoS) target for interactive web brows-
ing. As marked in Figure 1.1, the 3-second mark corresponds to 9% performance
degradation of web browsing, which can be tolerated without sacrificing user experi-
ence. However, in all seven realistic workloads, web browsing experiences significant
performance degradation that exceeds the latency tolerance of the workload. This
performance degradation is expected to worsen as more and more GPU cores and
accelerators are being integrated into the smartphone chip, leading to even more sig-
nificant contention at the main memory for concurrent workloads and resulting in
additional performance loss and unpredictability.
Web browsing, being one of the most commonly-used smartphone applications,
serves as a standard for many html-based smartphone applications. Its performance
has a direct impact on user satisfaction, and consequently, to the revenue of web-
sites. The complexity of web pages continues to increase, with sizes of web pages
projected to exceed 2MB by 2018 [3], leading to an ever-increasing demand in com-
putation and memory resource requirements. Thus, accelerating the performance of
web browsing has gained significant attention as a point for performance and energy
optimization [42, 45].
Existing work on co-optimization of performance and energy efficiency has ig-
nored the effect of background tasks or co-scheduled applications on smartphones
that represent realistic user scenarios, which can result in highly sub-optimal solu-
3
01
2
3
4
5
6
0.7 0.8 0.9 1.1 1.5 1.7 1.9 2.2W
e
b
 P
ag
e
 L
o
ad
 T
im
e
 (
se
co
n
d
s)
 
Core Frequency (GHz) 
Deadline 
Figure 1.2: Impact of memory interference on webpage load time, at different fre-
quencies for the webpage Reddit. The High-Low bars indicate the range of webpage
load times experienced at each frequency under the presence of different degrees of
interference.
tions. Figure 1.2 shows measurements 1 of the webpage load time for Reddit at
different frequencies, when it is co-scheduled with applications with different memory
intensities. The vertical bars show the variations in load time over repeated loads,
at each frequency, and the dotted line shows the average values. The results indicate
significant variation in the load times, with the magnitude of the variation depending
on the frequency. In general, the lower the memory interference, the lower is the
load time for any given frequency. This variation in load times, which are due to
the state of the other processes, can results possible violation of QoS requirements.
For example, although the 0.9 GHz frequency setting allows the webpage to meet
the 3-second load time deadline when the extent of memory interference is low, the
1The data is collected with a Google Nexus 5 smartphone by rendering Reddit alone and con-
currently with the other interfering applications in a multiprogrammed manner on the Qualcomm
Krait application processor.
4
01
2
3
4
5
6
Lo
ad
 T
im
e
(S
e
co
n
d
s)
deadline
fopt
0.2
0.25
0.3
0.5 1 1.5 2 2.5
En
e
rg
y 
Ef
fi
ci
e
n
cy
 
(P
P
W
)
CPU Core Frequency (GHz)
Figure 1.3: The most energy-efficient frequency setting (fopt) for the webpage Reddit.
webpage would miss its deadline with greater interference. There are even greater
variations in the measured energy efficiency (performance-per-watt or PPW) at each
frequency due to varying memory interference. In fact, the frequency at which the
PPW are maximum typically varies with the extent of interference that is present in
the system. Figure 1.3 shows the PPW plot of Reddit at different frequencies, when
it is co-scheduled with interfering application. The most energy-efficient frequency
setting (fopt) is the one that allows the webpage to load with least amount of energy
while meeting the QoS target. We observe that 1.2GHz is the most energy-efficient
frequency setting which allows Reddit to meet the 3-second QoS target. The re-
5
sult demonstrates the importance of considering both performance and power of web
browsing in order to choose the optimal frequency setting that maximizes device en-
ergy efficiency while ensuring webpages meet their QoS deadlines, when co-scheduled
with interfering applications.
1.1 Research Overview
The primary focus of my thesis is to develop an effective solution to provide
performance QoS guarantee for mobile web browsing in the presence of other co-
scheduled applications. My thesis first characterizes the performance degradation
caused by the interference at the memory subsystem for a broad range of commonly-
used smartphone applications. To address the performance degradation caused by
the memory interference, this thesis proposes a light-weight user space frequency
governor which allows web browser to meet its specified deadline while achieving
maximum energy efficiency.
1.1.1 Characterization of Memory Interference for Heterogeneous Smartphones
My thesis first characterizes the memory interference at the LLC and main mem-
ory for a wide range of emerging smartphone workloads [38]. I evaluate a set of
user-centric, interactive workloads, such as web browsing and video playback, as well
as machine-learning, image-based computations, compression algorithms, and many
other applications representative of future smartphone workloads. I also design ex-
periments to focus on memory interference-caused performance degradation in the
shared cache memory of the application processor as well as in the main memory
shared between the application processor and other accelerators. The real-device
performance characterization results on a Google Nexus5 smartphone show that the
performance of user-centric web browsing and media player is significantly degraded
6
by the memory interference at the application processor’s last-level cache and at the
shared communication fabric and the main memory., the performance degradation
experienced by web browsing is significant. The performance characterization re-
sults presented in the thesis motivate and offer insights into novel memory scheduling
designs as well as QoS-aware resource management techniques.
1.1.2 Memory Interference Mitigation
Since web browser is one of the most widely used application on smartphone, I look
at designing a memory interference mitigation technique for web browser. I implement
a light-weight user space frequency governor - Dynamic thrOttling-based memoRy
interference Aware technique, called DORA, within the Android Kit Kat operating
system on a Google Nexus 5 smartphone. DORA is a model-based frequency gov-
ernor which selects the optimal frequency when web browser is co-scheduled with
an interfering application. DORA computes the estimated web page load time and
device power at different frequency settings and selects an optimal energy-efficient
frequency setting periodically by using the statically-trained performance and power
models with dynamically-varying architecture and system conditions, such as the
memory access intensity of background processes and/or co-scheduled applications,
and temperature of cores. This allows web browsing to meet its specified deadline
and, simultaneously achieving maximum energy efficiency for the entire smartphone.
1.2 Thesis Contributions
My thesis characterizes the impact of memory degradation caused by the interfer-
ence at the memory system and proposes an effective memory interference mitigation
technique. Overall, this thesis makes the following contributions by:
7
1. Characterizing the degree of memory access interference patterns in modern
smartphone workloads, which has distinct user experience requirements.
2. Implementing an effective solution to provide performance QoS guarantee for
mobile web browsing in the presence of other co-scheduled applications.
3. Evaluating the light-weight user space frequency governor within the Android
Kit Kat operating system on a Google Nexus 5 smartphone. The design can be
easily ported to other smartphones and the model can be extended with online
learning.
1.2.1 Thesis Outline
The remainder of this thesis is organized as follows: Chapter 2 reviews modern
smartphone architecture and internals of web browser. Chapter 3 presents the real-
device memory interference characterization results. This chapter looks at the impact
of interference at the LLC and main memory. Then Chapter 4 presents the design of
a light-weight model-based frequency governor, Dynamic thrOttling-based memoRy
interferenceAware technique. This chapter also evaluates the underlying performance
and power models, and then evaluates the performance of the implemented frequency
governor. Chapter 5 summarizes the results and presents future research directions.
8
Chapter 2
BACKGROUND
To better understand the context of my work, I review background materials
on modern smartphone architectures and various sources of contention present in
heterogeneous systems.
2.1 SoC Architecture Overview
Modern smartphone architecture consists of a plethora of accelerators sharing the
main memory and communication fabric. Figure 2.1 illustrates the architecture of a
typical modern smartphone SoC. Due to the large number of accelerators that perform
computations with the general-purpose application processor, a mixture of memory
requests arrive at the main memory concurrently, leading to high contention in the
communication fabric as well as in the shared main memory.
For example, the Low Power Audio Subsystem (LPASS) is the hardware acceler-
ator for audio decoding in Qualcomm’s SoC. As the decoding computations migrates
from the application processor to LPASS, the application core could switch to a low
power state, reducing the total power consumption. However, since LPASS has an
limited amount of memory, the application processor has to wake up periodically and
manage the data and computation transfer to the accelerator. Similarly for other
accelerators such as the GPU or digital signal processors (DSP), data is transferred
frequently and periodically between the application processor and the accelerators
through the shared communication fabric and main memory. This leads to a high,
bursty memory bandwidth requirement for the memory.
9
Main Memory
Shared L2 Cache
Core0 Core1 Core2 Core3
GPU
(Adreno, Mali)
Camera / Image 
Signal Processor
Audio
Accelerator
Digital Signal
Processor
Connectivity
WiFi/Bluetooth
L1$ L1$ L1$ L1$
Figure 2.1: A typical heterogeneous mobile SoC running multiprogrammed workloads.
The LLC and LPDDR2/3 main memory are major shared resources serving requests
from heterogeneous accelerators and the application processor.
Another commonly used accelerator is the modem that receives data and places
the data directly into the main memory when users download files or receive emails.
Thus, the modem introduces additional contention to the communication fabric and
the memory and can potentially cause performance degradation to other tasks.
In addition to the contention at the communication fabric shared between the
application processor and the accelerators, the LLC is a shared resource for concurrent
applications that run on the application processor. The contention at the shared LLC
is also a major factor of performance degradation in modern smartphones. Therefore,
it is important to understand the shared last-level cache effect on performance in
smartphones.
10
Chapter3 presents the experiments to characterize the performance degradation
caused by memory interference in two different scenarios—memory interference in the
application processor, and in the shared communication fabric and main memory.
2.2 Web Page Execution Flow
The web browser fetches content from the internet and its rendering engine is
responsible for rendering (displaying) the content that is fetched. Since fetching is
dependent on network latencies, and thus out of the architect’s control, we focus on
studying the performance of the rendering engine in this work. The rendering engine
parses the web page’s HTML document. A HTML page provides the blueprint of the
web page. Two important components of a web page are the tags and attributes as
specified in the HTML page. The tags are used by the rendering engine to determine
outline of the various blocks of a web page. The attributes are associated with tags
and describe the characteristics of the each of these blocks. The tags and attributes
are parsed by the rendering engine to create a hierarchical structure called the DOM
tree, which defines the rendering order of the different blocks of each web page. The
DOM tree, along with the CSS attributes (which determine the visual properties and
style information of different blocks), complete the render tree. This render tree goes
through a layout and finally a paint stage to complete the web page load process.
Prior studies [21, 42] have shown that the web page load time is a function of the
complexity of the web page and is dominated by important web page features, such
as the number of tags, attributes, and the amount of meta data utilized by the web
page. Since these properties of web pages are available before the page is rendered,
the web page load time can be pre-computed fairly accurately. However, existing
approaches cannot accurately estimate the web page load time in the presence of
other concurrent tasks because they do not take into account memory interference.
11
Chapter 3
MEMORY INTERFERENCE CHARACTERIZATION
In this chapter, I present the experiments designed to characterize the performance
degradation caused by memory interference in two different scenarios—memory in-
terference in the application processor, and in the shared communication fabric and
main memory. The characterization results presented here are based on the Nexus 5
smartphone.
3.1 Experimental Methodology
This section introduces the experimental methodology for real device measure-
ment to quantify the degree of memory interference in current and future smartphone
workloads.
3.1.1 Device Infrastructure
I perform real device experiments on a Google Nexus 5 smartphone which houses
a Qualcomm MSM8974 Snapdragon 800 SoC. The SoC has four Krait cores in the
application processor with a 2GB Low Power DDR (LPDDR) memory. There are
separate L1 instruction and data caches for each core and a shared unified L2 cache of
2MB. The device runs a rooted Android 4.4 KitKat OS. There are two programmable
accelerators—the Adreno GPU and the Hexagon Digital Signal Processor (DSP).
The Adreno GPU supports OpenGL ES 3.0 and OpenCL whereas the Hexagon DSP
encompasses aDSP (application) and mDSP (modem). The specification is outlined
in Table 3.1.
12
Table 3.1: Device Specification
Google Nexus5
Operating System Android Kit Kat
Chipset MSM8974 Snapdragon 800
Application Processor Quad-core Krait
ISA ARMv7
L1 I/D Caches private 16KB per core
L2 Unified Cache shared 2MB
GPU Adreno 330
Advanced Graphics API OpenGL ES 3.0/OpenCL
DSP Hexagon DSP
Memory LPDDR3 2GB
Year of Release 2014
The Linux kernel is configured to enable performance profiling with Qualcomm’s
Snapdragon Performance Visualizer and Trepn Profiler [4, 5]. To eliminate the pos-
sibility of thermal emergencies that cause unpredictable frequency throttling for the
application processor, I manually set the frequency of the application processor to
operate at 1.574GHz 1 .
3.1.2 Benchmarks and Input Sets
I use frequently-executed interactive smartphone workloads to characterize the
memory interference-caused performance degradation. I perform in-depth evaluations
for the interactive smartphone workloads—web browsing [28, 36] and an open source
media player (VLC) [6]. Web browser mainly executes on the application cores and
occasionally offloads computations to the Adreno GPU. The average GPU utilization
hovers at 4.97%. The VLC media player could either execute on the application
processor or on the DSP. When focusing on the performance impact coming from
the shared LLC interference, I execute VLC on the application processor (under
1The highest application processor frequency of the Google Nexus5 smartphone is 2.2GHz. When
operating at this frequency, the application processor encounters frequent thermal emergencies,
leading to unpredictable changes in memory interference behavior. While the experients conducted
in this work are based on a lower frequency of 1.5GHz, the characterization results are representative
to smartphone workloads. I expect to see even higher memory interference at a higher operating
frequency, leading to more significant performance degradation for interactive smartphone workloads.
13
the disabled acceleration mode) whereas when focusing on the performance impact
coming from the shared communication fabric and the memory, I execute VLC on
the DSP (under the full-acceleration mode). This represents a realistic execution
behavior for mobile workloads. I also verify that the full-acceleration VLC media
player requires minimum utilization on the application processor and the Adreno
GPU.
In addition, I use algorithms from various benchmark suites, including Rodinia [22],
SPEC [16], and PARSEC [18] to represent future smartphone workloads. The Ro-
dinia applications used are written in OpenMP and OpenCL 2 and the remaining
applications used are written in C/C++. All benchmarks are cross-compiled on the
host machine with ARM-Android NDK tool chains [10], and are statically assigned
to a specific core. The binaries are pushed to the device and are launched from
the host machine via the adb terminal. Table 3.2 describes the application domains
represented by the algorithms used in this study.
Table 3.2: Descriptions for the Applications and the Corresponding Application Do-
mains.
App Domain RepresentativeAlgorithm Description
User-centric
Web Browsing bbench/GWB [28,36]
Web browsing sequentially loads and
renders a set of popular websites, in-
cluding Amazon, BBC, CNN, Craiglist,
eBay, ESPN, Google, MSN, Slashdot,
Twitter, and YouTube, from the sec-
ondary storage iteratively.
Media Player VLC [6]
VLC is an open source media player
that is used to render a full-screen
HD (720p) H264 MPEG-4 video of size
29MB.
SPEC [16]
2I run the OpenMP version of the Rodinia applications on the application processor whereas
the OpenCL version of the Rodinia applications are run on the programmable GPU to represent
GPGPU workload scenarios.
14
File Compres-
sion/Decompression bzip2
A compression/decompression algo-
rithm based on Julian Seward bzip2
version 1.0.3.
Vehicle Space Opti-
mization mcf
An algorithm used for single-depot ve-
hicle scheduling in public mass trans-
portation.
Pattern Search hmmer
An algorithm that performs sensitive
database searching, using statistical de-
scription of protein sequences and is
typically used in computational biol-
ogy to search for patterns in DNA se-
quences.
PARSEC [18]
Image Recognition
and Augmented
Reality
ferret
An image recognition application that
is commonly used for content-based
similarity search of feature-rich data,
such as images or videos.
Rodinia [22]
Sensor Data Analysis back propaga-tion (bp)
A machine learning algorithm that
trains the weights of connecting nodes
on a layered neural network.
Image Processing heartwall; srad
A movement tracking algorithm of a
heart over a sequence of 104 ultrasound
images; a diffusion method for ultra-
sonic and radar imaging applications
based on partial differential equations.
Thermal Prediction
and Management hotspot
An iterative processor thermal model-
ing algorithm that solves a collection of
differential equations.
Video Games lavamd
An algorithm that calculates particle
potential and relocation due to mutual
forces between particles within a large
3D space.
Medical App needleman-wunsch (nw)
A non-linear global optimization
method for DNA sequence alignments.
Communication Pro-
tocols
LU-
decomposition
(lu)
A numerical method that factors a ma-
trix as the product of a lower triangular
matrix and an upper triangular matrix
for solving a system of linear equations.
Map and Navigation nearest neigh-bors (nn)
An algorithm that finds the k-nearest
neighbors from an unstructured data
set.
Graph
Search/Traversal
breadth first
search (bfs)
An algorithm that traversing/searching
tree data structures.
15
Table 3.3: Workload Combinations for Scenario 1a
Use Case (Workload Combination)
WL1: Augmented Reality Compression, sensor data analysis, image recog-
nition, thermal management (bzip2, bp, ferret,
hotspot)
WL2: Video Game Video game, communication protocol, image
processing, compression (lavamd, lu, srad,
bzip2)
WL3: Sensor-based Medical App Sensor data analysis, graph search/traversal,
medical app, communication protocol (bp, bfs,
nw, lu)
WL4: Vehicle Navigation Map navigation, graph search, communication
protocol, vehicle space optimization (nn, bfs, lu,
mcf)
WL5: Medical DNA Sequencing App Pattern search, sensor data analysis, medical
app, image processing (hmmer, bp, nw, heart-
wall)
3.1.3 Metrics
Performance degradation is used as a metric to quantify memory interference and
is defined as follows –
1. For web browsing, performance degradation is the webpage loading time when
the browser runs with other co-scheduled applications, normalized to the time
when it runs alone.
2. For media player, performance degradation is the frames rendered per second
(fps) when the media player runs with other co-scheduled applications, normal-
ized to the fps when it runs alone.
3. For all other workloads, performance degradation is the execution time when
an individual application runs with other co-scheduled applications, normalized
to the execution time when it runs alone.
16
03
6
9
12
15
18
co
m
p
re
ss
io
n
se
n
so
r 
d
at
a 
an
al
ys
is
im
ag
e 
re
co
gn
it
io
n
th
er
m
al
 m
an
ag
em
en
t
G
M
EA
N
vi
d
eo
 g
am
e
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
im
ag
e 
p
ro
ce
ss
in
g
co
m
p
re
ss
io
n
G
M
EA
N
se
n
so
r 
d
at
a 
an
ly
si
s
gr
ap
h
 s
ea
rc
h
m
ed
ic
al
 a
p
p
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
G
M
EA
N
m
ap
 n
av
ig
at
io
n
gr
ap
h
 s
ea
rc
h
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
ve
h
ic
le
 s
p
ac
e…
G
M
EA
N
p
at
te
rn
 s
ea
rc
h
se
n
so
r 
d
at
a 
an
al
ys
is
m
ed
ic
al
 a
p
p
im
ag
e 
p
ro
ce
ss
in
g
G
M
EA
N
Augmented Reality Video Game Sensor-based Medical App Vehicle Navigation Medical DNA Sequencing
App
%
 P
e
rf
o
rm
an
ce
 D
e
gr
ad
at
io
n
 (
C
o
m
p
ar
e
d
 t
o
 
Ea
ch
 A
p
p
lic
at
io
n
 R
u
n
n
in
g 
A
lo
n
e
)
Figure 3.1: Scenario 1a: Application processor memory interference. Performance
degradation of each application compared to its performance when running alone on
the application processor.
3.2 Scenario 1: Application Processor Memory Interference
To quantify the degree of performance degradation caused by the memory inter-
ference within an application processor, I design experiments to evaluate the per-
formance effects in A) sustained computation-bound smartphone workloads, and B)
interactive, real-time smartphone workloads.
3.2.1 Sustained Computation-bound Smartphone Workloads
Compute-intensive workloads form the building blocks for future mobile work-
loads such as augmented reality and navigation, which require image processing, im-
age recognition algorithms, machine learning algorithms, optimization algorithms,and
many others.
I construct five representative workloads for different computation scenarios—
Augmented Reality (WL1), Video Game (WL2), Sensor-based Medical App (WL3),
Vehicle Navigation (WL4), and Medical DNA Sequencing App (WL5), described in
17
Table 3.3. I evaluate the performance degradation in each scenario in the presence of
application processor last-level cache and main memory interference. Figure 3.1 shows
that, when compared to the performance of each application running alone on the
smartphone application processor, applications experience 0.5% to 17.1% performance
slowdown. This performance degradation comes from the memory interference in the
shared L2 cache of the application processor and the main memory used by all co-
scheduled applications.
In particular, the performance degradation is most pronounced in the Augmented
Reality workload (WL1). The performance of the compression algorithm is signifi-
cantly affected by the memory intensive sensor data analysis algorithm when both
are executed on the general purpose application processor. This suggests that, in
order to ensure guaranteed performance of the compression algorithm used for im-
ages collected with sensors, explicit memory management is required. If the two
algorithms were run on accelerators, explicit data communication techniques should
be devised to ensure the two algorithms coordinate their computations to minimize
memory interference.
The overall system throughput degradation for the five representative workloads
studied in this work varies from 1 to 7% (geometric means), much lower than the
observed performance degradation in shared chip-multiprocessor caches [29, 37, 40].
I infer this to the relatively large L2 cache in the Nexus5 device (2MB), the lower ap-
plication processor operating frequency (1.5GHz), and the unique memory intensity
characteristics of the smartphone workloads. While the degree of the performance
degradation in the sustained computation-bound smartphone workloads is less pro-
nounced, it is still important to devise effective designs to bound the performance
degradation and improve the system throughput. This is especially useful if some
18
applications in the workload combinations are latency-critical, e.g., the medical app-
related algorithms.
3.2.2 Interactive, Real-time Smartphone Workloads
In addition to compute-intensive workloads, I also evaluate the performance degra-
dation experienced by interactive, real-time smartphone workloads, which are repre-
sentative of today’s smartphone user-oriented scenarios. I run web browsing and the
VLC media player individually with two other randomly chosen applications from
Table 3.2 on the four-core Krait processor. To minimize compute resource contention
and to focus on memory interference-induced performance degradation, I reserve two
cores for web browsing. The setup is based on the observation from a recent study [26]
that shows the utilization of web browsing hovers at around two cores. Similarly, I
reserve two of the four Krait cores to the VLC media player. When the media player
is executed on the application processor, it requires a minimum of two cores to achieve
the desired frame rate, i.e., 30 frames per second (fps).
When web browsing and other compute-oriented workloads are co-scheduled on
the application processor, the performance of web browsing is degraded by 6% to 27%
(Figure 3.2). This performance degradation comes from interference at the memory
subsystem. Some of the performance degradation can be tolerated, while further
performance degradation directly translates to lowered user experiences. In the case
of web browsing, the performance degradation under 9% can be tolerated such that
the web page loading/rendering can still be achieved under the 3-second mark 3 As
Figure 3.2 shows, the performance of web browsing is degraded beyond the 3-second
mark (9%) in most of the workloads.
3I assume one second delay from the network latency.
19
11.05
1.1
1.15
1.2
1.25
1.3
0 10 20 30 40 50 60
W
e
b
 B
ro
w
si
n
g 
P
e
rf
o
rm
an
ce
 
D
e
gr
ad
at
io
n
 (
%
)
Workloads
2.8 seconds
3.0 seconds
3.2 seconds
Figure 3.2: Scenario 1b: Performance degradation S-curve for web browsing when it
is co-scheduled with other applications on the application processor.
I perform similar experiments to quantify the degree of memory interference-
caused performance degradation for media player. Figure 3.3 shows that, the per-
formance of VLC is degraded by up to 21%. In contrast to web browsing, most of
the time the performance of the media player is unaffected by the co-scheduled work-
loads. Its performance, however, can be significantly degraded (by 21%) when the
co-scheduled applications are memory-intensive. This is exemplified by the workload
combination under which the media player is executed concurrently with the most
memory-intensive computation-oriented applications, i.e., sensor data analysis (bp)
and medical app (nw). The 21% performance degradation corresponds to a sub-25fps
frame rate, leading to poor video quality and lowered user experience.
20
11.05
1.1
1.15
1.2
1.25
0 10 20 30 40 50 60
M
e
d
ia
 P
la
ye
r 
P
e
rf
o
rm
an
ce
 
D
e
gr
ad
at
io
n
 (
%
)
Workloads
30 fps
28 fps
26 fps
Figure 3.3: Scenario 1b: Performance degradation S-curve for media player when it
is co-scheduled with other applications on the application processor.
Table 3.4: Workload Combinations for Scenario 2a
Use Case (Workload Combination [cpu,gpu])
WL6: Thermal Management Thermal management/prediction, communica-
tion protocol [hotspot, lu]
WL7: Mobile Medical App Map and navigation, medical app [nn, nw]
WL8: Online Medical App Communication protocol, medical app [lu, nw]
WL9: Vehicle Routing Vehicle space optimization, sensor data analysis
[mcf, bp]
WL10: Augmented Reality Image recognition, map and navigation [ferret,
nn]
3.3 Scenario 2: Memory Interference in the Shared Communication Fabric and
Main Memory
This section presents the characterization results for workloads that utilize smart-
phone accelerators and focuses the memory interference results at the shared main
memory.
21
05
10
15
20
25
th
er
m
al
 m
an
ag
em
en
t
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
G
M
EA
N
m
ap
 a
n
d
 n
av
ig
at
io
n
m
ed
ic
al
 a
p
p
G
M
EA
N
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
m
ed
ic
al
 a
p
p
G
M
EA
N
ve
h
ic
le
 s
p
ac
e 
o
p
ti
m
iz
at
io
n
se
n
so
r 
d
at
a 
an
al
ys
is
G
M
EA
N
im
ag
e 
re
co
gn
it
io
n
m
ap
 a
n
d
 n
av
ig
at
io
n
G
M
EA
N
WL6 WL7 WL8 WL9 WL10
%
 P
e
rf
o
rm
an
ce
 D
e
gr
ad
at
io
n
 (
C
o
m
p
ar
e
d
 t
o
 
Ea
ch
 A
p
p
lic
at
io
n
 R
u
n
n
in
g 
A
lo
n
e
)
Figure 3.4: Scenario 2a: Memory interference in the shared communication fabric
and main memory.
3.3.1 Sustained Computation-bound Smartphone Workloads
To quantify the performance impact when the communication fabric and the
memory are shared by sustained computation-oriented co-scheduled workloads, I
design experiments to perform computations on the application processor and the
programmable GPU. I construct five representative workloads for different computa-
tion scenarios—Thermal Management (WL6), Mobile Medical App (WL7), Online
Medical App (WL8), Vehicle Routing (WL9), and Augmented Reality (WL10). I
run one application on the application processor and another application on the pro-
22
grammable GPU to evaluate the effect of shared memory contention in smartphone
SoCs. Table 3.4 describes the workload combinations in more detail.
The performance degradation results vary across the different workload combina-
tions. Figure 3.4 shows that the performance degradation of the application running
on the application processor is significant (21.7% for the map and navigation ap-
plication in WL7) while the performance degradation of the application running on
the GPU is below 4.18% (the medical app in WL8). In general, I observe that the
application which runs at the application processor is more sensitive to the main
memory contention. This can be attributed to the higher memory traffic require-
ment of the application running on the programmable GPU. This agrees with the
observation from prior studies that focus on the shared cache and memory between
a chip-multiprocessor (CMP) and a discrete GPU [30, 34]—without explicit manage-
ment, the bandwidth demanding GPU could significantly degrade the performance
of CMPs while the GPU performance is largely unaffected.
3.3.2 Interactive, Real-time Smartphone Workloads
To quantify the performance degradation experienced by web browsing and VLC
media player, I design experiments to cause memory interference at the main memory
between the application processor and the programmable accelerators, i.e., the Adreno
GPU and the Hexagon DSP.
Specifically, I run web browsing together with one other sustained computation-
bound application on the application processor and a GPGPU workload on the
Adreno GPU. In addition, I run VLC that is fully accelerated on the Hexagon DSP
23
Table 3.5: Workload Combinations for scenario 2b
Use Case Workload Combination
WL11: Browsing + music player Web browsing, audio playback
WL12: Browsing + video conferencing Web browsing, Skype
Use Case Workload Combination [DSP,cpu,gpu]
WL13: VLC + augmented reality Video playback, image recognition/augmented
reality, sensor data analysis [VLC, ferret, bp]
WL14: VLC + vehicle routing Video playback, vehicle space optimization,
map/navigation [VLC, mcf, nn]
WL15: VLC + medical app Video playback, communication protocol, med-
ical app [VLC, lu, nw]
WL16: VLC + Skype VLC, Skype
together with two other sustained computation-bound applications on the application
processor and a GPGPU workload on the Adreno GPU 4 .
Overall, web browsing experiences significant performance degradation caused by
memory interference at the main memory whereas VLC media player does not observe
any performance degradation in the workload combinations described in Table 3.5 5
6 . Figure 3.5 compares the performance of the user-centric, interactive smartphone
applications – web browsing and media player in the co-scheduled workloads. The
web browsing performance is significantly degraded by 34% when users browse web
pages and make Skype calls at the same time. This can be attributed to the intensive
data streaming rate of the co-scheduled Skype call. On the other hand, web browsing
experiences 2% performance slowdown when it is co-scheduled with music playback.
This performance degradation can be tolerated under the 3-second mark without
4The design principle behind the experimental methodology is to maximize the computation
resources while minimizing the likelihood of computation resource contention. For example, when
running an OpenCL application on the programmable GPU, I allocate one of the four Krait cores to
the OpenCL application for executing the host-side serial part of the code. Similar, when running
the VLC media player on the programmable DSP, I allocate one of the four Krait cores for handling
the control and accelerator management. Then I maximize the utilization on the Krait processor
with additional co-scheduled workloads.
5Web browsing utilizes both the application processor and the Adreno GPU; therefore, the work-
load combinations are designed to utilize the remaining available accelerators, e.g. LPASS, Hexagon
DSP, and CP.
6VLC is executed on the Hexagon DSP in the full acceleration mode whereas the other co-
scheduled applications are run on the application processor and the Adreno GPU respectively.
24
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
w
eb
 b
ro
w
si
n
g
m
u
si
c 
p
la
ye
r
w
eb
 b
ro
w
si
n
g
Sk
yp
e
m
ed
ia
 p
la
ye
r
im
ag
e 
re
co
gn
it
io
n
se
n
so
r 
d
at
a 
an
al
ys
is
m
ed
ia
 p
la
ye
r
ve
h
ic
le
 s
p
ac
e…
m
ap
/n
av
it
at
io
n
m
ed
ia
 p
la
ye
r
co
m
m
u
n
ic
at
io
n
 p
ro
to
co
l
m
ed
ic
al
 a
p
p
m
ed
ia
 p
la
ye
r
Sk
yp
e
WL11 WL12 WL13 WL14 WL15 WL16
P
e
rf
o
rm
an
ce
 D
e
gr
ad
at
io
n
 (
%
)
3.0-second mark 30-fps mark
Figure 3.5: Scenario 2b: Performance degradation of interactive, real-time smart-
phone workloads.
affecting user experience. In addition, Figure 3.5 shows that the performance of
media player is never affected by the co-scheduled applications when it is executed on
the dedicated Hexagon DSP while the performance of the co-scheduled applications
could experience degradation. The un-affected fps performance could be attributed
to the relatively intensive memory access rate of the media player when running on
the dedicated accelerator compared to that of other applications.
For web browsing and other user-interactive workloads, I envision the performance
degradation to worsen as more and more GPU cores and accelerators are being inte-
25
grated into the smartphone chip, leading to even more significant contention at the
main memory and resulting in additional performance loss and unpredictability.
3.4 Chapter Summary
Given the abundance of heterogeneous computing units, the shared memory re-
source becomes highly-contended, leading to QoS violations for the feature-rich yet
timing-critical smartphone applications. The results indicate a significant 34% and
32% performance degradation for web browsing and media player, respectively, when
they are co-scheduled with other applications. The characterization results offer valu-
able insights and understandings of the unique memory subsystem behavior of current
and future smartphone applications which motivates us to develop intelligent memory
interference mitigation techniques that satisfy user performance requirements while
delivering high performance at the same time.
26
Chapter 4
MEMORY INTERFERENCE MITIGATION
We have already observed that the performance of web browsing is degraded sig-
nificantly and unpredictably, when co-scheduled with other processes or applications.
This unpredictability leads to possible violation of QoS requirements. We also ob-
serve that there is a high degree of variability when we consider the energy efficiency
(PPW) that is achieved at the different frequencies, with varying degree of memory
interference. Therefore, it is imperative to choose the optimal frequency setting that
maximizes device energy efficiency while ensuring webpages meet their QoS deadlines.
To predict this optimal frequency setting, a Dynamic thrOttling-based memoRy
interference-Aware technique called DORA, is explored. DORA makes online fre-
quency setting predictions using statically-trained timing and device power models
that consider dynamically-varying architecture and system conditions. The timing
model is built for web page load time prediction while the power model is built for
device power consumption prediction for all processor frequency settings. DORA
computes the estimated web page load time at different frequency settings and se-
lects the setting that allows web browsing to meet its specified deadline and, at the
same time, achieving maximum energy efficiency for the smartphone.
4.1 Experimental Methodology
In this section, I describe the experimental setup, the web pages and interfering
applications I use to build the web page load time and power models, and to evaluate
DORA.
27
Table 4.1: Web Page Classification
Complexity Load Time Web Pages
Low < 2 Sec Amazon, Twitter, Youtube, 360, MSN, BBC, CNN, Reddit,
Alibaba
High > 2 Sec IMDB, ESPN, Hao123, Imgur, Aliexpress
4.1.1 Real Device Measurement Infrastructure
I perform experiments on a Google Nexus 5 smartphone [13] housing Qualcomm
MSM8974 Snapdragon 800 chipset [15]. The chipset has 14 different frequency set-
tings available, ranging from 300MHz to 2265MHz. Table 3.1 summarizes the device
specifications. The device runs the rooted Android KitKat 4.4 operating system [8].
The kernel is configured to enable performance profiling using perf [14]. I use National
Instruments Data Acquisition Unit (DAQ) to measure smartphone power consump-
tion [2].
4.1.2 Web Page Characteristics
I utilize a selection of fourteen popular web pages reported on “Alexa top 500
websites” [1] for our studies 1 . The chosen web pages represent a wide variety of
application domains such as online shopping, sports, entertainment, news, and social
media. They also vary widely in their complexity resulting in load times in the range
of hundred of milliseconds to 4 seconds, when they execute alone. The web pages are
classified based on their complexity, and the classifications are shown in Table 4.1.
I utilize Firefox mobile web browser for the experiments. The source code of web
pages is instrumented to enable the web page load time measurement. All web pages
are stored in memory, eliminating any non-deterministic network fluctuation.
1I utilize top 14 web pages that could load completely on an Android smartphone. I also ignore
search engines as the page of interest in a search engine would be the search results page and not
the typically simplistic home page of the search engine.
28
Table 4.2: Interfering Application Classification.
Intensity L2 MPKI Interfering Applications
Low < 1 Image processing (SRAD, Heart Wall), clustering analysis
(Kmeans), temperature management (HotSpot)
Medium 1 - 7 Image processing (SRAD2), tree and graph traversal
(Breadth-First Search, B+ Tree)
High > 7 Sensor data analysis (Back Propagation), bioinformatics
(Needleman-Wunsch)
4.1.3 Interfering Application Characteristics
I use a diverse set of algorithms that form the basic building blocks of current and
future smartphone workloads to act as interfering applications. These algorithms
are taken from the Rodinia benchmark suite [22] as they represent a wide variety
of smartphone application domains, such as sensor data analysis, image processing,
thermal prediction and management, video games, and medical applications. I classify
these algorithms based on their memory intensity, and the classification is shown in
Table 4.2. All the interfering applications are cross-compiled using ARM-Android
NDK toolchains [9] and are statically assigned to a specific core. The applications
are pushed to the device and launched using the Android debug bridge (Adb) [7]
terminal.
4.1.4 Multiprogrammed Workloads
The experiments are conducted to mimic a multiprogrammed execution scenario
on a modern smartphone. Specifically, I execute the Firefox browser on two cores
and an interfering application on the third core of a quad-core processor 2 . The
workloads are created by combining each web page (Table 4.1) with one interfering
application from each memory intensity category shown in Table 4.2. This results
in 42 workload combinations in total, i.e., 14 web pages, each co-scheduled with an
2The fourth core was kept switched off for all the experiments.
29
interfering application from the low, medium, and high intensity categories. The 42
workloads created are shown in Table B.1.
4.1.5 Model Parameters and Configuration
I acquire over 300 instances of power and web page load time measurements by
executing multiple web page workload combinations at different frequency settings,
using the setup described in Section 4.1.1. I use these observations to estimate the
coefficients of the power and web page load time models using minimum least squares.
I evaluate three kinds of regression models: simple linear regression, linear regression
that accounts for interaction between independent variables (interactions model),
and a quadratic model. I carry out the model development and analysis using MAT-
LAB [12].
4.2 Design
In this section I present the design of the algorithm to dynamically set the
PPW-optimal frequency subject to a given deadline, for a smartphone running a
web browser. The algorithm is referred to as DORA, which stands for Dynamic
thrOttling-based memoRy interference Aware technique. The objective of DORA is
to provide a high-quality web browsing experience for smartphone users while maxi-
mizing the battery lifetime.
The first design challenge is to accurately predict the range of core frequencies,
i.e., fi...n, that ensure the web pages complete rendering within the given deadline.
The web page load time prediction has to consider the interference introduced by
background processes and co-scheduled workloads. The second design challenge is to
accurately identify a core frequency (fopt) within fi...n such that the energy efficiency
of the smartphone device is maximized.
30
Webpage Complexity
Main Memory
Shared L2 Cache
D
O
R
A
 F
re
q
u
e
n
cy
 
C
o
n
tr
o
l 
Optimal Frequency, fopt
frequency
W
e
b
p
ag
e
 
Lo
ad
 T
im
e
T L
o
a
d
P
o
w
e
r 
C
o
n
su
m
p
ti
o
n
P
d
yn
+ 
P
lk
g
P
P
W
 
En
e
rg
y 
Ef
fi
ci
e
n
cy
deadline
frequency range
X
fopt
DORA Frequency Decision
Time
Core0 Core1 Core2 Core3
L1$ L1$ L1$ L1$
CPU Utilization
L2 MPKI
CPU Core 
Temperature
P
e
rf
o
rm
an
ce
 
M
o
d
e
l
P
o
w
e
r
M
o
d
e
l
100msec sampling interval
CPU Utilization L2 MPKI CPU Core Temperature
Figure 4.1: DORA overview. DORA periodically monitors the core utilization, core
temperature and LLC MPKI to select the most energy efficient frequency setting that
allows the web page to meet its QoS target load time.
DORA tackles the first design challenge by using a web page load time prediction
model that takes into account the complexity of web pages, the degree of memory
interference introduced by background processes and co-scheduled workloads, and
the core operating frequencies (Section 4.3.1). To select fopt, DORA utilizes a power
model that estimates the device power consumption for all core operating frequencies
(Section 4.3.2). With the web page load time and device power models, DORA calcu-
lates the performance-per-watt (PPW) energy efficiency values at all core operating
frequencies and sets the processor frequency to result in the highest energy efficiency
for the device.
To take into account the dynamic nature of interference by co-scheduled applica-
tions, DORA performs the aforementioned steps at regular intervals. It monitors the
intensity of memory interference and determines fopt for the current time period and
adjusts the core operating frequency to fopt in the next period. Figure 4.1 shows this
31
Algorithm 1: DORA pseudo-code.
1: function DORA(QoS_target, web page_complexity, Core_Utilization, Core_Temp,
L2_MPKI) . select an energy-efficient, QoS-aware frequency setting
2: max_PPW ← 0
3: optimal_freq ← 0
4: for F in AllFrequencies do
5: pred_time← PredictLoadT ime(F )
6: if pred_time <= QoS_target then
. QoS_target is met at this frequency
7: pred_power ← PredictTotalPower(F )
8: pred_PPW ← 1pred_time∗pred_power
9: if pred_PPW > max_PPW then
10: max_PPW ← pred_PPW
11: optimal_freq ← F
end
end
end
12: SetCoreFrequency(optimal_freq)
13: end function
iterative process, as DORA executes in the background periodically with minimal
impact on user experience. The outline of DORA, in the form of pseudo-code, is
shown in Algorithm 1.
4.3 Power and Performance Models
I construct the web page load time and device power models using regression. A
regression model is a hypothesized parametric relationship between the response or a
dependent variable, y, (power or performance) and a set of independent variables Xi,
i ∈ 1, 2, 3, ..., n. I use polynomial regression, where the unknown model parameters
32
are the polynomial coefficients, which are estimated by minimizing the mean-square
error between a set of observed values and model predicted values.
4.3.1 Model-based Web Page Load Time Prediction
I evaluated the correlation of different HTML elements of web pages with the cor-
responding load times in the presence of interference. I observed that five important
parameters of web pages best represent web page complexity and hence the impact on
web page load time — the number of DOM Tree nodes, class and href attributes,
a and div tags. Therefore, I include the aforementioned parameters in the web page
load time prediction model. Zhu et al. [42] identified a similar set of parameters to
predict the web page load time in the absence of interference.
Next, I identified the runtime architectural parameters that influence the web
page load time. In order to account for the influence of memory interference on the
performance of web browsing, the degree of interference in the shared memory was
considered, i.e., the access rate in the shared last-level L2 cache and in the shared
DRAM. Furthermore, I observed that the memory intensity, as represented by the
numer of L2 cache misses per kilo instructions (MPKI), has a large influence on the
web page load time.
Finally, the core operating frequencies and the memory bus frequencies also have
a pronounced impact on the web page load time. It should be noted that as the core
frequency increases, the memory bus frequency increases non-linearly on the Google
Nexus 5 platform [13]. Specifically, on a Google Nexus 5 platform, a set of core
frequencies map to a particular memory bus frequency. Therefore, I build piecewise
models for each set of core frequencies that share a single memory bus frequency in
order to take the impact of memory bus frequency into account. With the above
33
insights, I evaluate three different regression models. These are the linear, interaction
and quadradic model as shown in Equations (4.1)–(4.3) respectively:
L = c0 +
N∑
i=1
ciXi (4.1)
L = c0 +
N∑
i=1
ciXi +
∑
i,j∈(1...N)
ci,jXiXj (4.2)
L = c0 +
N∑
i=1
ciXi +
∑
i,j∈(1...N),i 6=j
ci,jXiXj (4.3)
where L is the webpage load time and Xi represents the independent variables,
and the ci’s and ci,j’s are the coefficent paramters that must be determined. The
variables are detailed in Table 4.3.
Table 4.3: List of Independent Variables
Description
X1 Number of DOM tree nodes
X2 Number of class attributes
X3 Number of href attributes
X4 Number of "a" tags
X5 Number of "div" tags
X6 LLC MPKI - shared L2 cache misses per kilo instructions
X7 Core Frequency
X8 Number of attributes (Only used for the power model)
X9 CPU utilization of co-scheduled task (Only used for the power model)
I observe that a non-linear model (products of independent of variables) is nec-
essary to capture the covariance among the control variables and the web page load
time.
4.3.2 Dynamic and Leakage Power Prediction
DORA’s power model includes the dynamic power Pdyn, and the leakage Plkg.
Pdyn depends on the compute and memory resource utilization of the cores and the
34
corresponding processor voltage and frequency settings, and Plkg depends on the
operating voltage and temperature.
Dynamic Power Model: Similar to the web page load time model, web page
complexity is identified to be a good predictor for the dynamic power consumption
of web pages. In addition, the degree of memory interference is also a key factor
that significantly contributes to the smartphone power consumption. As interference
at the L2 cache increases, additional data movement is required to fetch data into
the L2 cache upon demand. This incurs additional power and energy cost. Thus,
the parameter of LLC MPKI is included in the dynamic power model. To consider
the dynamic power consumption contribution from the cores running background
processes or co-scheduled applications, core utilization is used. Core utilization has
been shown to have a linear relationship with core dynamic power consumption for
general compute-bound workloads [19]. Finally, the core operating frequency has a
direct effect on the dynamic power consumption. With these insights, I once again
utilize the same three response surface models as in (4.1)–(4.3); however this time,
there is an additional dependent term, i.e. the co-scheduled task’s CPU utilization,
and the dependent variable is now the the total devocedynamic power.
I observe that a simple linear regression model is sufficient to capture the rela-
tionship between the independent variables and smartphone power consumption.
Leakage Power Model: Due to the lack of cooling elements in most smartphones,
high thermal levels often contribute to a significant portion of the total device power
budget in the form of leakage power. Therefore, when computing the total device
power consumption, we must include the effect of leakage power consumption such
that the energy-efficient setting prediction for fopt considers this important factor
representing realistic usage scenarios. I utilize the empirical model of [31] to capture
35
the non-linear power dependence on temperature and voltage as exhibited by CMOS-
based technologies:
Plkg = k1vT 2e
αv+β
T + k2e(γv+δ) (4.4)
where k1, k2, α, β, γ, and δ are parameters that depend on circuit topology, and
v and T are the operational voltage and temperature of the SoC respectively. The
parameters of the leakage power model are determined using non-linear numerical
solutions and mean square error minimization. By using the per-core thermal sensors,
I construct a leakage model for each computational unit to increase the accuracy of
the leakage model.
Since different frequencies would cause different switching activity and power dis-
sipation, they result in different temperature increase as well. In order to model this
temperature rise, I observe the temperature rise between the current and previous
time instants, ∆T(n,n−1). Since this temperature rise is caused by the power dissipa-
tion that occurred during the previous sampling period, Ptotaln−1 , I use simple linear
scaling to predict the temperature increase at different frequencies as follows:
Tn+1f = Tn +
∆T(n,n−1) ∗ Pdyn(f,n)
Ptotaln−1
(4.5)
where, Tn+1f is the predicted temperature if we were to choose the frequency f , Tn
is the current temperature, ∆T(n,n−1) is the change in temperature from the previous
time instant, Pdyn(f,n) is the predicted dynamic power at this time instant for the
frequency f , and Ptotaln−1 is the predicted total power at the previous time instant.
36
Table 4.4: Performance and Power Model Accuracy
Performance Model Accuracy (%) Complexity(nano-seconds)
Linear Model 77.39% 143.01
Interaction Model 97.73% 844.87
Quadratic Model 97.81% 1076.22
Power Model Accuracy Complexity
Linear Model 96.59% 102.15
Interaction Model 97.00% 393.6
Quadratic Model 97.02% 558.85
4.4 Performance and Power Model Accuracy
In this section I look at the accuracies of the performance and power models. As
mentioned in Section 4.3, I look at three different regression models namely simple
linear regression, linear regression that accounts for interaction between independent
variables (interactions model), and a quadratic model. Table 4.4 shows the accuracy
and complexity 3 of the three models considered for web page load time and power
prediction.
4.4.1 Performance Model Accuracy
The average error rate for the web page load time model is 22.61%, 2.27%, and
2.19% for linear, interaction, and quadratic model respectively. Figure 4.2 shows the
cumulative distribution of prediction errors in the load time model. Each point (x, y)
shows the fraction of web pages (y) which have errors lower than or equal to (x). From
Figure 4.2 we can see that for an interaction model about 92.5% of the web pages
have less than 5% error. A linear model has a very high error rate and only 12% of
web pages have less than 5% error rate. The quadratic model on the other hand has
93.4% of the web pages under 5% error rate. If we observe the complexity overhead
of the quadratic model in comparison to an interaction model, we can observe that an
3Complexity is calculated in nanoseconds as the time taken to compute the load time or power
by the models on the basis of the latency of algorithmic operations.
37
00.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
Fr
ac
ti
o
n
 o
f 
w
eb
p
ag
es
Error %
Linear Interaction Quadratic
Figure 4.2: The cumulative distribution of prediction errors for performance models.
interaction model is sufficient to capture the covariance among the control variables
and the web page load time. An interaction performance model has a maximum error
of 11%. It is important to note that models are evaluated for same set of webpages as
used to create the model but with different set of interfering applications. This was
primarily done to focus on the impact of memory interference. Additionally, I also
look at how these models perform for new set of webpages (Alipay, Ebay, Firefox,
Instagram). The average error rate for the web page load time model is 28.92%,
14.88%, and 22.37% for linear, interaction, and quadratic model respectively for the
new set of webpages.
4.4.2 Power Model Accuracy
Figure 4.3 shows the cumulative distribution of prediction errors in the power
model. The average error rate for the power model is 3.41%, 3.00%, and 2.97% for
linear, interaction, and quadratic model respectively. Figure 4.3 shows that for 77.67%
of web pages, the linear model has less than 5% error, and for 94.1% of web pages, the
38
00.2
0.4
0.6
0.8
1
0 5 10 15
Fr
ac
ti
o
n
 o
f 
w
eb
p
ag
es
Error %
Linear Interaction Quadratic
Figure 4.3: The cumulative distribution of prediction errors for power models.
linear model has less than 10% error. An interaction and quadratic model behaves
similar to linear model in terms of accuracy , but with being 3-5 times more complex.
A simple linear regression model is sufficient to predict the power consumption of
the web page load process in the presence of interference. A linear power model has
a maximum error of 11.5%. Similar to web page load time models, I evaluate the
accuracies of dynamic power model for new set of web pages. I observe error rate of
4.07%, 5.14%, 6.03% for linear, interaction, and quadratic model respectively for the
new set of webpages.
4.5 Evaluation Results and Analysis
4.5.1 Performance and Energy Efficiency Trends
DORA is compared with existing Android frequency governors, interactive,
and performance. 4 The performance governor always operates the cores in the
4I do not consider powersave as an effective governor as it results in unreasonably long load
times (7 - 26 seconds) for all workloads while also being extremely energy inefficient.
39
highest available frequency of 2.2GHz. The interactive governor on the other hand
monitors the processor utilization and chooses a high frequency when the utilization
is high and lower the frequency when utilization is low. I use interactive as the
baseline for our studies as it is the default option on most smartphones today.
Figures 4.4 show the average energy efficiency improvement and the distribution
of web page load time achieved by DORA and other governors. On average, predic-
tions using DORA results in an improvement of 18% in energy efficiency compared
to the baseline interactive governor. Frequency settings based on DORA web page
load time predictions results in meeting the QoS target of 3-second deadline when-
ever possible with the available frequency settings on the SoC. When the smartphone
reaches near to 100% utilization, the baseline interactive and performance gov-
ernors perform nearly identically, they both result in the same energy efficiency and
load time. Although performance and interactive generally achieve faster web
page load time than DORA (Figure 4.4(b)), this comes at the cost of lower energy
efficiency (Figure 4.4(a)). I also assess how closely DORA performs to a static offline
optimal configuration Offlineopt 5 and found that, among the workloads evaluated,
DORA achieves the same energy efficiency improvement as Offlineopt, by improving
the smartphone energy efficiency by an average of 20%.
4.5.2 The Adaptive Nature of DORA
DORA is designed to dynamically predict the web page load time and energy
efficiency in order to determine fopt given in Equation 4.6. fE is the frequency that
maximizes the PPW. It is the unconstrained (i.e. w.r.t a deadline) frequency that
5Offlineopt represents the single frequency setting that maximizes the energy efficiency achieved
while loading the web page within the 3-second deadline. I obtain the PPW results for Offlineopt by
enumerating all possible frequency settings for ten randomly chosen workloads from the workloads
constructed in this work. The time taken to generate the PPW results for all possible frequency
settings for all available workloads is prohibitively high.
40
0.9
0.95
1
1.05
1.1
1.15
1.2
En
e
rg
y 
Ef
fi
ci
e
n
cy
 (
P
P
W
)
N
o
rm
al
iz
e
d
 t
o
 in
te
ra
ct
iv
e
Governor
0
0.25
0.5
0.75
1
0 1 2 3 4
Fr
ac
ti
o
n
 o
f 
W
e
b
p
ag
e
s
Load Time (Seconds)
Interactive Performance DORA
(a) (b)
Figure 4.4: Average energy efficiency and web page load time comparison of DORA
with other governors across all evaluated workloads.
results in maximized battery life. On the other hand, fD is the lowest frequency
setting that allows the web page to meet its QoS target performance. DORA chooses
the fopt by estimating fE, the most energy-efficient frequency setting, and fD, the
lowest frequency that meets the deadline. Figure 4.5 shows the improvement in
energy efficiency achieved by the different governors, for all evaluated workloads.
The distinctive behavior of DORA is shown by workloads WL13 and beyond. For the
aforementioned workloads, DORA chooses energy efficienct fE frequency setting and
results in an average improvement in energy efficiency of 24% when compared to the
interactive governor. It is important to note that for these workloads, DORA’s
predictions result in meeting the 3-second load time deadline while achieving the
41
0.9
1
1.1
1.2
1.3
1.4
1 6 11 16 21 26 31 36 41
En
e
rg
y 
Ef
fi
ci
e
n
cy
 (
P
P
W
)
N
o
rm
al
iz
e
d
 t
o
 in
te
ra
ct
iv
e
Workload Number
Interactive Performance DORA
fD > fE fE > fD
Figure 4.5: Energy efficiency comparison of DORA with other governors for all evalu-
ated workloads. Each point on the x-axis represents a different workload whose energy
efficiency improvement for different governors is plotted along the y-axis. The work-
loads are sorted in the order of energy efficiency improvement achieved by DORA.
maximum possible energy efficiency. These workloads are classified under “fE > fD”
category.
fopt =

fE fD ≤ fE
fD fD > fE
(4.6)
Workloads WL1 through WL12, are classified under “fE < fD” category. In such
scenarios, choosing fE violates the 3-second web page load time deadline. Since fE
is the energy efficient frequency, it ignores the QoS. Therefore, chossing fE results
in large violations in QoS, resulting in web page load times as high as 6 seconds.
For workloads where choosing fE is not the optimal choice resulting in missing QoS
target, DORA correctly identifies the frequency range that allows the web page to
meet its QoS target, and then selects the most energy efficient frequency setting
42
0.9
1
1.1
1.2
1.3
1 2 3 4 5 6 7 8 9 10 11 12
En
e
rg
y 
Ef
fi
ci
e
n
y 
(P
P
W
) 
N
o
rm
al
iz
ed
 t
o
 in
te
ra
ct
iv
e
Workload Number
Interactive Performance DORA
fD > fE fE > fD
Figure 4.6: Energy efficiency comparison of DORA with other governors.
within that range. This frequency often coincides with the frequency fD for these
workloads. This results in DORA meeting the QoS deadline or being close to the
QoS deadline for these workloads. Generally, DORA satisfies the 3-second deadline
as long as the deadline is met by the performance governor. For workloads where the
web pages cannot load within the 3-second deadline at the higher frequency setting,
DORA prioritizes for QoS and chooses the highest frequency setting to ensure that
the web pages are loaded as fast as possible. This resiliency to variations in workloads
underscores the fact that an optimal frequency governor for today’s smartphones such
as DORA, needs to adapt its frequency decision while considering energy efficiency
and performance in the presence of run-time factors like memory interference.
It is important to note that DORA was evaluated for same set of webpages as
used to create the model but with different set of interfering applications. This was
primarily done to focus on the impact of memory interference. Additionally, the
43
1.14
1.16
1.18
1.2
1.22
P
P
W
 N
o
rm
al
iz
e
d
 t
o
 
In
te
ra
ct
iv
e
0
0.2
0.4
0.6
0.8
1
Fr
e
q
u
e
n
cy
 s
e
le
ct
io
n
 
B
re
ak
d
o
w
n
0.96GHz 1.49GHz 1.72GHz
(a) (b)
Figure 4.7: Interaction of DORA with varying co-scheduled application memory in-
tensity. (a) Energy efficiency comparison for DORA. (b) DORA frequency selection
for low, medium and high memory intensity co-scheduled application.
top 50 most popular webpages is a relatively stable set [1] thus DORA is trained
around the most common web browsing scenarios. I also look at the performance
of DORA for additional 4 webpages (Ebay, Alipay, Instagram, Firefox) with 3
different applications that are not used to train the performance and power models
and we observe similar energy efficiency improvement over the Android governors.
The workloads are mentioned in Table B.2. Figure 4.6 shows the energy efficiency
improvement achieved by DORA over interactive for 12 new combinations. On
average, DORA results in improvement of 10% in energy efficiency over interactive
governor. We observed a similar trend in Figure 4.5. Upto WL5, DORA chooses fD
and beyond that DORA chooses fE.
44
4.5.3 Interaction of DORA with Memory Interference Intensity
The memory intensity of the co-scheduled applications varies with time, and
DORA needs to account for that while choosing the optimal frequency (fopt). To
highlight the behavior of DORA in such varied conditions, I take a closer look at the
fopt chosen by DORA when a web page Msn is co-scheduled with a low, a medium
or a high intensity application. Figure 4.7(a) shows the improvement in PPW com-
pared to the interactive governor and Figure 4.7(b) shows the frequencies chosen
by DORA. Due to the fact that webpages when scheduled with low memory intensity
applications suffer less degradation as compared to when scheduled with high inten-
sity applications, the optimal frequency for the operation changes, and so does the
PPW improvement. For example, Figure 4.7(b) shows that as memory intensities
change from medium to high, DORA chooses 1.72GHz as the fopt for 10% more time.
This can be attributed to more memory interference.
4.5.4 Impact of Leakage Power
An important feature of DORA is its consideration of temperature that influences
the selection of fopt. Prior work, such as [42, 45], does not consider the leakage
power component when optimizing for energy efficiency. This is likely to lead to a
sub-optimal frequency setting. In order to evaluate the impact of leakage power on
energy efficiency, I compare DORA with a configuration that does not take leakage
power into account, DORA_no_lkg. That is, DORA makes the frequency selection
decision using the dynamic power consumption component only. Figure 4.8(a) shows
the energy efficiency of DORA and DORA_no_lkg for the web page Amazon when it
is co-scheduled with a medium memory intensity application. I observe that DORA
45
0.9	
0.95	
1	
1.05	
1.1	
1.15	
1.2	
En
er
gy
	E
ffi
ci
en
cy
	(P
PW
)	
no
rm
al
iz
ed
	to
	in
te
ra
c(
ve
	
1	
2	
3	
4	
0.
7	
0.
8	
0.
9	
1.
1	
1.
5	
1.
7	
1.
9	
2.
2	P
ow
er
	C
on
su
m
p;
on
	(W
)	
Core	Frequency	(GHz)	
Low	Ambient	Temperature	
Room	Temperature	
fopt	
fopt	
(a)	 (b)	
Figure 4.8: Impact of leakage power. (a) Energy efficiency comparison for DORA
and DORA without taking leakage power into account. (b) Power consumption at
different frequencies under two ambient conditions — room temperature and low
ambient temperature.
is able to achieve 10% higher energy efficiency compared to the configuration that
considers only dynamic power consumption.
To delve deeper, I explore the impact of temperature (and consequently leakage
power) on power consumption by evaluating the variation of power consumption at
different frequencies, and fopt under room temperature and low ambient temperature
operation (Figure 4.8(b)). From Figure 4.8(b) we can see that there is a significant
increase in the power consumption at higher frequencies in regular operating con-
ditions, compared to a low ambient temperature condition. This increase in power
consumption can be attributed to the additional leakage power due to the increase in
device temperature. I also observe that the maximum device temperature increases
46
00.5
1
1.5
2
2.5
0 2 4 6 8 10
C
P
U
 C
o
re
 F
re
q
u
e
n
cy
 
(G
H
z)
QoS Target 
(seconds)
Figure 4.9: DORAs frequency selection for different QoS targets
from 58 to 65 degree Celsius when operating at 1.9 GHz under ambient room temper-
ature. This temperature rise is significantly higher at higher frequencies than lower
frequencies, making leakage power a significant contributor to device power at high
frequencies. This increase in temperature and consequently leakage power results in
the optimal operating point, fopt, shifting from 1.9 GHz to 1.7 GHz for this work-
load. DORA is able to predict this significant additional leakage power as shown in
Equation 4.4 and identify fopt accurately, resulting in increased energy efficiency.
4.5.5 Interaction of DORA with varying QoS
DORA is designed to dynamically predict the web page load time and energy
efficiency in order to determine fopt. As already seen that DORA either chooses
the fE, the most energy-efficient frequency setting, or fD, the lowest frequency that
meets the deadline. The fE and fD can change with varying QoS and DORA can
easily adapt to different QoS. In this section, I look at the performance of DORA for
different QoS targets. To highlight the behavior of DORA for different QoS targets, I
47
look at the fopt chosen by DORA when (Hao123) is co-scheduled with a high intensity
application. Figure 4.9 shows the fopt chosen by DORA for different deadlines. We
can see that for QoS targets below 3 seconds DORA chooses the highest frequency
(i.e. 2.2GHz) in order to try to meet the targeted deadline. In these cases, the QoS
target is actually infeasible due to the given frequency range of the processor and
the system conditions (e.g. co-scheduled task); however, DORA attempts to come as
close to the QoS target as possible in these situations. When the target QoS is set
between 3 and 6 seconds we observer a trade-off between QoS and power (which all
other terms constant, is monotonically increasing with frequency). This is due to fopt
being limited by the QoS target constraint. Once the deadline increases, fopt reaches
a value of 1.19GHz which is equivalent to the scenario in which no constraint is given.
4.5.6 Overhead
From the implementation standpoint, DORA includes three key operations, namely,
(1) periodically assessing hardware performance counters, (2) computing the optimal
frequency point, fopt, and (3) switching the core frequency to fopt if the newly com-
puted fopt is different from the current setting.
DORA is a lightweight controller. Its time and power overhead coming from the
first two tasks mentioned above is less than 1%. This low overhead stems from the
fact that the first two steps are non-intrusive to the webpage load process and occur in
the background. Although most prior work assume a relatively small overhead for the
frequency scaling operation mentioned above, I observe that this overhead is slightly
higher than that of the first two tasks (a maximum of 3% for my evaluated workloads).
DORA monitors the variation in the runtime system performance conditions and
decides to change the frequency setting only when the system performance conditions
have changed significantly enough to alter fopt. This results in the overhead of DORA
48
to be dependent on the number of times frequency is scaled during the web page load
process. This overhead is negligible for workloads which enjoy a relatively stable
phase behavior during the web page load process. For other workloads where DORA
scales the processor frequency often, the energy efficiency improvement brought by
DORA is high; thus, the overhead associated with the needed frequency scaling is
considered to be worthwhile.
4.6 Comparison with Prior Work
The study of smartphone web browser performance has been the subject of many
recent works as it is one of the most widely used applications on mobile devices.
Web browser performance optimization can be achieved through enhancements at
multiple levels of the hardware-software stack. Many of the early works on browser
optimization focused on improving browser specific tasks through software techniques
such as task parallelization, browser rendering, and smarter browser caching, such
as [17, 20, 32, 35, 41].
Butkeiwicz et al. [21] related the web page complexity with important web page
primitives and characterized the impact of web page complexities on web page perfor-
mance. Their correlation and regression based analysis shows the impact of number
of object and servers on the webpage load time and variation; however, they fail to
analyze the factors affecting the webpage energy. Thiagarajan et al. [39] presented
a detailed breakdown of web page energy consumption based on the different web
page primitives and proposed a set of web page design recommendations to minimize
energy consumption. But these work did not consider energy consumption.
Prior works have suggested that micro-architectural techniques such as branch pre-
dictors, advanced cache and prefetcher management techniques can improve browser
performance and, therefore, reduce its energy consumption significantly [28, 36]. Zhu
49
et al. proposed hardware specializations to improve the performance and energy
efficiency of mobile web browsing [43]. Another recent work by Fan et al. [25] demon-
strated improved browser efficiency with asymmetric multiprocessors that share their
cache hierarchies.
These aforementioned software and micro-architecture level works are orthogonal
to DORA which performs energy efficiency optimization in the system level. There-
fore, I expect the performance and energy efficiency gains from DORA to be additive.
Many other system level designs aimed at improving the energy efficiency and
QoS of mobile browsers have been reported. Lo et al. [33] considered the response
time and the limits of human perception together to find opportunities to throttle
frequencies while executing interactive applications on an Odroid SoC board. While
their design is QoS-aware, it is not necessarily energy optimal as we demonstrate with
the DL governor in this work. Bui et al. [20] and Dong et al. [24] used web page
premitives to design energy efficient web browsers. Zhu et al. [42] developed models
for web page load time and energy consumption to design a deadline- and energy-
aware governor for the big.LITTLE heterogeneous architecture. The same authors,
in a recent work [45], extended the work to use the models to design an event-based
scheduler. Another recent work by Gaudette et al. [27] developed probabilistic models
to account for non-determinism in web page load time and demonstrated improved
QoS for the web browser.
Although many of above works optimize for QoS and energy efficiency for mobile
web browsing, none of them explicitly consider the effect of memory interference.
As shown in this work, memory interference plays a key role resulting in upto 33%
degradation in webpage load time, and impacts both web page load time and energy
efficiency significantly for modern smartphones.
50
4.7 Chapter Summary
Web browsing, being one of the most commonly-used smartphone applications,
performance has a direct impact on user satisfaction, and consequently, to the rev-
enue of websites. Moreover, on current heterogeneous smartphones supporting mul-
tiprogrammed workloads, interference results in significant degradation in webpage
load time. I implement a model-based approach for predicting the performance and
power consumption of web browsing in the presence of background processes and
co-scheduled applications. DORA selects an optimal energy-efficient frequency set-
ting periodically by using the statically-trained performance and power models with
dynamically-varying architecture and system conditions, such as the memory access
intensity of background processes and/or co-scheduled applications, and temperature
of cores. The results show high prediction accuracies for the timing and power models
of 97.5% and 96%, respectively. By operating at the most energy-efficient frequency
setting in the presence of interference, DORA improves smartphone energy efficiency
by as much as 35% and an average of 18% compared to the existing interactive gov-
ernor, while maintaining the satisfactory performance of web page loading under 3
seconds.
51
Chapter 5
CONCLUSION
This thesis presents a detailed performance characterization for the interference
between co-scheduled smartphone applications in the memory subsystem on a real
mobile platform. Given the abundance of heterogeneous computing units, the shared
memory resource becomes highly-contended, leading to QoS violations for the feature-
rich yet timing-critical smartphone applications. The results presented here indicate
a significant 34% and 32% performance degradation for web browsing and media
player, respectively, when they are co-scheduled with other applications.
Furthermore, this thesis presents a frequency controller that ensures web browser
QoS and maximizes energy efficiency for mobile systems in the multiprogrammed
execution context. This work first explores and identifies models that are sufficiently
accurate in web page load time and smartphone device power predictions. Using the
models, a frequency throttling-based technique is presented that chooses an optimal
frequency setting allowing web pages to load within the pre-specified deadline while
maximizing the energy efficiency during the web page load process.
The performance and power models are offline-trained with web page character-
istics and make online predictions by taking into account run-time system behavior,
such as memory access intensities, core utilization, and temperature. The governor
is implemented and evaluated on a Google Nexus 5 smartphone. The real system
measurements demonstrate that the governor provides an average of 18% and a max-
52
imum of 35% energy efficiency improvement over the baseline interactive governor
while meeting the 3-second load time QoS deadline whenever possible.
5.1 Future Research Directions
Given the rise of heterogeneous SoC architectures and its increasing importance,
it is important to investigate the resource sharing implications of the numerous ac-
celerators present in today’s smartphones. The characterization results presented in
this works shows the performance impact of memory interference on current gen-
eration smartphones for current and future workloads. This motivates us to look
into the need of an effective memory interference mitigation techniques. The char-
acterization results highlight that sustained computation bound workloads as well
as user interactive workloads suffer significant amount of performance degradation
because of memory interference. While some of the previously proposed design con-
cepts can be used to manage the shared resources in the smartphone domain, none
of prior works has investigated and designed QoS-aware shared resource management
solutions specifically for smartphones. The models developed in this work can be
extended to accommodate different SoC architectures. They simply need to be re-
parameterized and re-evaluated. For big.LITTLE architectures, DORA can be used
at per-core level or at cluster level. DORA needs to be augmented with different
load time and power models for the big and the little cores. Moreover, this work
identifies the browser parameters and architectural parameters that most affect the
web browsing performance and total device energy consumption.
DORA takes into account CPU core temperature while choosing the optimal fre-
quency setting. I utilize the IR camera to measure and map the rise in the temperature
when web page is co-scheduled with interfering application while DORA and inter-
active governor is running. Figure 5.1 shows the final temperature of the chip when
53
Alibaba is scheduled with a medium intensity interfering application. Figure 5.1(a)
shows the temperature when DORA is scheduled while Figure 5.1(b) shows the tem-
perature when interactive governor is running. It can be seen that temperature rise
incase of interactive governor is more as compared to when DORA is running. The
chip cross-section experiencing highest temperature rise consists of memory, DRAM
and CPU stacked over each other on a multi-layer PCB. Appendix C shows the lay-
out of the device cross-section with various components present. The temperature
of the hotspot in Figure 5.1(b) is 41 Celsius which is above the first threshold tem-
perature for Nexus 5, which causes it to trigger CPU frequency throttling. I observe
from the logs and monitoring device stats that this causes device to throttle to lower
frequency incase of interactive governor specifically from 2.2GHz to 1.9GHz. Since
DORA schedules the CPU core to run at 1.7GHz or 1.4GHz (depending on the phase),
it prevents the device to run into any thermal emergencies, thus preventing the device
to experience temperature induced frequency throttling.
(a) (b)
Figure 5.1: Nexus 5 IR image showing the temperature of the chip. (a) and (b) shows
the final temperature of the chip when Alibaba is scheduled with a medium intensity
interfering application with DORA and interactive governor running respectively.
54
REFERENCES
[1] Alexa Top 500 Global Websites. http://www.alexa.com/topsites.
[2] National instruments - data acquisition (DAQ). http://www.ni.com/
data-acquisition/.
[3] Web Page Size. http://www.webperformancetoday.com/2015/01/14/
mobile-page-growth.
[4] Snapdragon Performance Visualizer. URL https://developer.
qualcomm.com/mobile-development/increase-app-performance/
snapdragon-performance-visualizer.
[5] Qualcomm Trepn Profiler. URL https://developer.qualcomm.com/
mobile-development/increase-app-performance/trepn-profiler.
[6] VLC media player. URL http://www.videolan.org/vlc/index.html.
[7] Android debug bridge. http://developer.android.com/tools/help/adb.
html.
[8] Android KitKat 4.4. https://www.android.com/versions/kit-kat-4-4/, .
[9] Android NDK. http://developer.android.com/tools/sdk/ndk/index.html,
.
[10] Android NDK toolset. URL https://developer.android.com/tools/sdk/
ndk/index.html.
[11] Smartphone market share. URL http://mobiforge.com/research-analysis/
global-mobile-statistics-2014-part-a-mobile-subscribers-handset/
market-share-mobile-operators.
[12] MATLAB. http://www.mathworks.com/products/matlab/.
[13] Google Nexus 5. http://www.gsmarena.com/lg_nexus_5-5705.php.
[14] Linux profiling with performance counters. https://perf.wiki.kernel.org/
index.php/Main_Page.
[15] Snapdragon 800. https://www.qualcomm.com/products/snapdragon/
processors/800.
[16] Standard performance evaluation corporation benchmark suite, 2006. URL
https://www.spec.org/benchmarks.html.
[17] Carmen Badea, Mohammad R. Haghighat, Alexandru Nicolau, and Alexander V.
Veidenbaum. Towards parallelizing the layout engine of firefox. In USENIX
Conference on Hot Topics in Parallelism, 2010.
55
[18] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PAR-
SEC benchmark suite: Characterization and architectural implications. In Pro-
ceedings of the International Conference on Parallel Architectures and Compila-
tion Techniques, 2008.
[19] W. L. Bircher, Jason Law, Madhavi Valluri, and Lizy K. John. In Technical
Report TR-041104-01, Electrical and Computer Engineering Dept., University
of Texas, Austin TX, 2004.
[20] Duc Hoang Bui, Yunxin Liu, Hyosu Kim, Insik Shin, and Feng Zhao. Rethink-
ing energy-performance trade-off in mobile web page loading. In International
Conference on Mobile Computing and Networking, 2015.
[21] Michael Butkiewicz, Harsha V. Madhyastha, and Vyas Sekar. Understanding
website complexity: Measurements, metrics, and implications. In ACM SIG-
COMM Conference on Internet Measurement Conference, 2011.
[22] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer,
Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous
computing. In Proceedings of the IEEE International Symposium on Workload
Characterization, 2009.
[23] Kwang-Ting Cheng and Yi-Chu Wang. Using mobile GPU for general-purpose
computing: a case study of face recognition on smartphones. In Proceedings of
the International Symposium on VLSI Design, Automation and Test, 2011.
[24] Mian Dong and Lin Zhong. Chameleon: A color-adaptive web browser for mobile
oled displays. IEEE Trans. Mob. Comput., 11:724–738, 2012.
[25] Songchun Fan and Benjamin C. Lee. Evaluating asymmetric multiprocessing for
mobile applications. In International Symposium on Performance Analysis of
Systems and Software, 2016.
[26] Cao Gao, A. Gutierrez, Madhav Rajan, Ron Dreslinski, Trevor Mudge, and
Carole-Jean Wu. A study of mobile device utilization. In Proceedings of the In-
ternational Symposium on Performance Analysis of Systems and Software, 2015.
[27] Benjamin Gaudette, Carole-Jean Wu, and Sarma Vrudhula. Improving smart-
phone user experience by balancing performance and energy with probabilistic
qos guarantee. In International Symposium on High Performance Computer Ar-
chitecture, 2016.
[28] A. Gutierrez, R.G. Dreslinski, T.F. Wenisch, T. Mudge, A. Saidi, C. Emmons,
and N. Paver. Full-system analysis and characterization of interactive smart-
phone applications. In Proceedings of IEEE International Symposium on Work-
load Characterization, 2011.
[29] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon
Steely, Jr., and Joel Emer. Adaptive insertion policies for managing shared
caches. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, 2008.
56
[30] Jaekyu Lee and Hyesoon Kim. TAP: A TLP-aware cache management policy
for a CPU-GPU heterogeneous architecture. In Proceedings of the International
Symposium on High-Performance Computer Architecture, 2012.
[31] Weiping Liao, Lei He, and Kevin M Lepak. Temperature and supply voltage
aware performance and power modeling at microarchitecture level. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems, 24(7):
1042–1053, 2005.
[32] D. Lin, N. Medforth, K. S. Herdy, A. Shriraman, and R. Cameron. Parabix:
Boosting the efficiency of text processing on commodity processors. In Interna-
tional Symposium on High Performance Computer Architecture (HPCA).
[33] Daniel Lo, Taejoon Song, and G. Edward Suh. Prediction-guided performance-
energy trade-off for interactive applications. In International Symposium on
Microarchitecture, 2015.
[34] Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. Managing
shared last-level cache in a heterogeneous multicore processor. In Proceedings
of the International Conference on Parallel Architectures and Compilation Tech-
niques, 2013.
[35] Leo A. Meyerovich and Rastislav Bodik. Fast and parallel webpage layout. In
International Conference on World Wide Web, 2010.
[36] Dhinakaran Pandiyan, Shin-Ying Lee, and Carole-Jean Wu. Performance, energy
characterization and architectural implications of an emerging mobile platform
benchmark suite – MobileBench. In Proceedings of IEEE International Sympo-
sium on Workload Characterization, 2013.
[37] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition shared caches.
In Proceedings of the International Symposium on Microarchitecture, 2006.
[38] D. Shingari, A. Arunkumar, and C. J. Wu. Characterization and throttling-based
mitigation of memory interference for heterogeneous smartphones. In Interna-
tional Symposium on Workload Characterization, 2015.
[39] Narendran Thiagarajan, Gaurav Aggarwal, Angela Nicoara, Dan Boneh, and
Jatinder Pal Singh. Who killed my battery?: Analyzing mobile browser energy
consumption. In International Conference on World Wide Web, 2012.
[40] Yuejian Xie and Gabriel H. Loh. PIPP: Promotion/insertion pseudo-partitioning
of multi-core shared caches. In Proceedings of the International Symposium on
Computer Architecture, 2009.
[41] Kaimin Zhang, Lu Wang, Aimin Pan, and Bin Benjamin Zhu. Smart caching for
web browsers. In International Conference on World Wide Web, 2010.
57
[42] Yuhao Zhu and Vijay Janapa Reddi. High-performance and energy-efficient mo-
bile web browsing on big/LITTLE systems. In International Symposium on High
Performance Computer Architecture, 2013.
[43] Yuhao Zhu and Vijay Janapa Reddi. WebCore: Architectural support for Mo-
bileweb browsing. In Proceeding of the International Symposium on Computer
Architecuture, 2014.
[44] Yuhao Zhu, M. Halpern, and V.J. Reddi. Event-based scheduling for energy-
efficient QoS (eQoS) in mobile web applications. In Proceedings of the Interna-
tional Symposium on High Performance Computer Architecture, 2015.
[45] Yuhao Zhu, M. Halpern, and V.J. Reddi. Event-based scheduling for energy-
efficient QoS (eQoS) in mobile web applications. In International Symposium on
High Performance Computer Architecture, 2015.
58
APPENDIX A
59
PERFORMANCE AND POWER MODELS
Nexus 5 has 14 different CPU frequency settings available, ranging from 300MHz
to 2265MHz and 6 memory bus frequency ranging from 150MHz to 800MHz. Table
A.1 shows the CPU frequencies corresponding to a memory bus frequency.
Table A.1: Frequency Domains
Memory Bus Frequency(MHz) CPU Core Frequency(MHz)
150 300, 424
200 652
307 729, 883
460 960, 1036, 1190
614 1267, 1497
800 1574, 1728, 1958, 2265
I construct four different webpage load time models for four different memory
frequencies (307MHz, 460MHz, 614MHz, and 800MHz). A.1 corresponds to bus fre-
quency 307MHz, A.1 corresponds to bus frequency 460MHz, A.3 corresponds to bus
frequency 614MHz and A.4 corresponds to bus frequency 800MHz. I also construct
a single dynamic power model, shown in A.5
LoadT ime = 0.11849 ∗X1 − 0.099997 ∗X2 − 0.16663 ∗X4 − 0.080886 ∗X5
− 210.73 ∗X7 − 0.013121 ∗X6 − 0.000022067 ∗X1 ∗X2
− 0.001987 ∗X1 ∗X3 + 0.0017912 ∗X1 ∗X4 − 0.000038667 ∗X1 ∗X5
+ 0.40411 ∗X1 ∗X7 + 0.000075057 ∗X1 ∗X6 + 0.0015224 ∗X2 ∗X3
− 0.00131 ∗X2 ∗X4 + 0.000089612 ∗X2 ∗X5 + 0.30367 ∗X2 ∗X7
+ 0.000013745 ∗X2 ∗X6 + 0.00025915 ∗X3 ∗X4 + 0.0033505 ∗X3 ∗X5
+ 64.969 ∗X3 ∗X7 + 0.00055156 ∗X3 ∗X6 − 0.0031645 ∗X4 ∗X5
− 62.253 ∗X4 ∗X7 − 0.00058218 ∗X4 ∗X6 + 1.4417 ∗X5 ∗X7
− 0.00010194 ∗X5 ∗X6 − 12.463 ∗X7 ∗X6
(A.1)
60
LoadT ime = 0.093822 ∗X1 − 0.081013 ∗X2 − 0.12891 ∗X4 − 0.065203 ∗X5
− 177.55 ∗X7 + 0.014776 ∗X6 − 0.000019697 ∗X1 ∗X2
− 0.0016516 ∗X1 ∗X3 + 0.0014938 ∗X1 ∗X4 − 0.00002876 ∗X1 ∗X5
+ 2.4408 ∗X1 ∗X7 + 0.000054907 ∗X1 ∗X6 + 0.0014956 ∗X2 ∗X3
− 0.001316900 ∗X2 ∗X4 + 0.000072384 ∗X2 ∗X5 − 0.87069 ∗X2 ∗X7
− 0.000015393 ∗X2 ∗X6 + 0.000201940 ∗X3 ∗X4 + 0.002434 ∗X3 ∗X5
+ 10.141 ∗X3 ∗X7 + 0.00063519 ∗X3 ∗X6 − 0.0022852 ∗X4 ∗X5
− 11.006 ∗X4 ∗X7 − 0.00068458 ∗X4 ∗X6 + 0.63561 ∗X5 ∗X7
− 0.000031694 ∗X5 ∗X6 − 29.506 ∗X7 ∗X6
(A.2)
LoadT ime =0.070028 ∗X1 − 0.060347 ∗X2 − 0.094594 ∗X4 − 0.046397 ∗X5
− 0.00028728 ∗X6 − 0.000014691 ∗X1 ∗X2 − 0.0012092 ∗X1 ∗X3
+ 0.0010942 ∗X1 ∗X4 − 0.000021279 ∗X1 ∗X5 + 0.000034918 ∗X1 ∗X6
+ 0.0011562 ∗X2 ∗X3 − 0.0010246 ∗X2 ∗X4 + 0.000053793 ∗X2 ∗X5
− 0.000019207 ∗X2 ∗X6 + 0.00014613 ∗X3 ∗X4 + 0.0016636 ∗X3 ∗X5
+ 0.00042852 ∗X3 ∗X6 − 0.001554700 ∗X4 ∗X5 − 0.00043368 ∗X4 ∗X6
− 0.000020167 ∗X5 ∗X6
(A.3)
LoadT ime = 0.058332 ∗X1 − 0.051529 ∗X2 − 0.07978 ∗X4 − 0.039141 ∗X5
+ 454.77 ∗X7 + 0.023517000 ∗X6 − 0.000012217 ∗X1 ∗X2
− 0.0010223 ∗X1 ∗X3 + 0.000926 ∗X1 ∗X4 − 0.00001728 ∗X1 ∗X5
− 0.13363 ∗X1 ∗X7 + 0.000009375 ∗X1 ∗X6 + 0.001016 ∗X2 ∗X3
− 0.00090457 ∗X2 ∗X4 + 0.000044356 ∗X2 ∗X5 + 0.48615 ∗X2 ∗X7
− 0.000000857 ∗X2 ∗X6 + 0.00012098 ∗X3 ∗X4 + 0.0013311 ∗X3 ∗X5
+ 10.649 ∗X3 ∗X7 − 0.00049672 ∗X3 ∗X6 − 0.0012402 ∗X4 ∗X5
− 7.7044 ∗X4 ∗X7 + 0.00049692 ∗X4 ∗X6 + 1.5724 ∗X5 ∗X7
+ 0.000015285 ∗X5 ∗X6 − 32.773 ∗X7 ∗X6
(A.4)
Dynamicpower = −0.24461− 0.000029942 ∗X1 − 0.000050714 ∗X8
+0.0016679 ∗X7 + 0.033868 ∗X6 + 0.13984 ∗X9 (A.5)
where
61
Table A.2: List of Independent Variables
Description
X1 Number of DOM tree nodes
X2 Number of class attributes
X3 Number of href attributes
X4 Number of "a" tags
X5 Number of "div" tags
X6 LLC MPKI - shared L2 cache misses per kilo instructions
X7 Core Frequency
X8 Number of attributes (Only used for the power model)
X9 CPU utilization of co-scheduled task (Only used for the power model)
62
APPENDIX B
63
WORKLOADS
This section lists the workloads used for DORA evaluation.
Table B.1: Workloads used for DORA Evaluation
Description
WL1 Imgur + BFS
WL2 Imdb + Kmeans
WL3 Imgur + Kmeans
WL4 Aliexpress + Srad1
WL5 Aliexpress + Backprop
WL6 Hao123 + Hotspot
WL7 Imgur + Needle
WL8 Aliexpress + Srad2
WL9 Hao123 + Needle
WL10 Hao123 + Btree
WL11 Imdb + Backprop
WL12 Imdb + Srad2
WL13 Reddit + Lavamd
WL14 BBC + Hotspot
WL15 ESPN + Needle
WL16 ESPN + Srad2
WL17 Amazon + Btree
WL18 Twitter + Needle
WL19 CNN + BFS
WL20 CNN + Kmeans
WL21 360 + Backprop
WL22 ESPN + Hotspot
WL23 360+ Heartwall
WL24 360 + Srad2
WL25 BBC + BFS
WL26 CNN + Needle
WL27 Youtube + Backprop
WL28 Twitter + Lavamd
WL29 Reddit + Needle
WL30 Amazon + Needle
WL31 Youtube + Btree
WL32 Alibaba + Backprop
WL33 Amazon + Heartwall
WL34 Youtube + Srad1
WL35 Alibaba + Hotspot
WL36 Twitter + BFS
WL37 Msn + Srad2
WL38 Reddit + Btree
WL39 Msn + Srad1
64
WL40 BBC + Backprop
WL41 Alibaba + BFS
WL42 Msn + Needle
Table B.2: Workloads used for DORA Evaluation
Description
WL1 Ebay + Backprop
WL2 Ebay + Kmeans
WL3 Ebay + Srad2
WL4 Alipay + Srad1
WL5 Alipay + Needle
WL6 Firefox + Hotspot
WL7 Alipay + Srad2
WL8 Firefox + Srad2
WL9 Firefox + Needle
WL10 Instagram + Btree
WL11 Instagram + Backprop
WL12 Instagram + Srad2
65
APPENDIX C
66
DEVICE LAYOUT
This section shows the layout of the device, highlighting the placement of the
major components present in the chipset. Nexus 5 houses Qualcomm APQ8074 chip.
Figure C.1: Nexus 5 has multi-layered PCB with different components stacked over
each other.
67
Figure C.2: Nexus 5 has multi-layered PCB with different components stacked over
each other.
Figure C.3: Nexus 5 has multi-layered PCB with different components stacked over
each other.
68
