Software systems for power and energy conservation by Kharbanda, Harshit
c© 2013 Harshit Kharbanda
SOFTWARE SYSTEMS FOR POWER AND ENERGY CONSERVATION
BY
HARSHIT KHARBANDA
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2013
Urbana, Illinois
Adviser:
Professor Roy H. Campbell
Abstract
The continued scaling of transistors in accordance with Moore’s law and the failure of Dennard
scaling has resulted in power becoming a critical factor in all facets of microprocessor design. To
counter the Power Wall, techniques such as voltage and frequency scaling, heterogeneous hardware
and voltage underscaling coupled with improved application reliability have been proposed. Most
of the innovation for power conservation has happened at the hardware level. Software needs to be
equally responsible for power conservation. For instance, mobile cloud computing is used to oﬄoad
compute intensive tasks that affect a mobile device’s battery. Mobile ad-hoc computing can be
used as an alternative to mobile cloud computing in cases where cloud access is not available or is
inhibitive to application performance. This thesis presents two systems - Synergy and Equilibria
which aid the hardware in its pursuit of conserving power.
Synergy is a middleware that increases the battery life for a system of mobile devices connected
in a peer-to-peer ad-hoc network. Synergy conserves energy by scaling core frequencies and by
intelligently distributing the computation among peer devices. The middleware is not restricted to
mobile phones and in no way restricts the mobility of the devices.
Equilibria provides software support for voltage domains connected in series. Hardware innova-
tions for power conservation have ignored the power losses incurred by the power delivery circuits
in face of low voltage and high current demands. Connecting the voltage domains in a series cir-
cuit (instead of parallel) can result in highly efficient power delivery. For series connected voltage
domains to work, the voltage draw by each load should be the same. This voltage drop (across the
load) is dependent on its CPU utilization. Equilibria is a load balancer which actively monitors the
CPU utilization of the processors in the system and ensures equal voltage draw. It runs on a master
processor and alters the frequency of the client processors based on the average CPU utilization
of all the processors in the system. Equilibria is generic, i.e. the system works irrespective of the
software running on the processors.
ii
To my parents, Dr. Suman Kharbanda and Dr. S.V. Kharbanda for having immense faith in my
abilities.
iii
Acknowledgements
I would like to thank my advisor Prof. Roy H. Campbell for his support and the freedom he gave me
throughout the duration of this work. His enthusiasm and energy were a huge source of motivation.
I would like to thank Professor Sam King and Professor Robert Pilawa for their immensely
helpful feedback. Their insightful thoughts and ideas had a huge impact on the project. Thanks
to Manoj Krishnan and Josiah McClurg, for their hard work and valuable suggestions. I am also
thankful to the researchers at University of Michigan and Qualcomm Inc. for writing PowerTutor
and AllJoyn respectively and for providing them as an open source tools.
Thank you Reza Farivar for your support during the long caffeine filled debugging sessions.
Finally, I would like to thank Riccardo Crepaldi and members of the SRG - Abhishek Verma,
Christina Abad and Mirko Montanari - friends and amazing mentors. They showed tremendous
patience during my incessant questions and have helped me learn the ways of research.
iv
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Datacenter Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Basics Of Power And Energy Conservation . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Power Savings Beyond Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 3 Energy Conservation On Mobile Devices . . . . . . . . . . . . . . . . . 6
3.1 Analysis Of Energy Consumption In A Mobile P2P Network . . . . . . . . . . . . . 6
3.2 Design Of Synergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Device Discovery And Communication . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Device State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.3 Computation Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.4 Frequency And Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.5 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 4 Energy Conservation In Datacenters . . . . . . . . . . . . . . . . . . . 14
4.1 Why Series Connected Voltage Domains . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Base Power Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Power Measurement Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Software Design For Series Connected Voltage Domains . . . . . . . . . . . . . . . . 18
4.3.1 Master Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 Client Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Power Savings In Datacenter Computers . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Energy Savings On Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.3 Latency-Energy Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Synergy’s Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.3 Data Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.4 Non-CPU Intensive Applications . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
v
List of Figures
1.1 Voltage supply scaling trends of logic technologies with years (i) HP-High Performance
(ii)LOP-Low Operating Power (iii) LSTP-Low Stand By Power and (iv) III-V/Ge-
High performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Energy Consumption vs. Number of devices . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Communication Latency vs. Number of devices . . . . . . . . . . . . . . . . . . . . . 8
3.3 System table maintained by each device . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Decision table maintained by Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Comparison of a 64 core, 30W server with cores connected in (a) Parallel and (b)
Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Portion of Raspberry Pi schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Power draw with frequency scaling for the Raspberry Pi . . . . . . . . . . . . . . . . 17
4.4 Frequency and power ratio for the Raspberry-Pi . . . . . . . . . . . . . . . . . . . . 17
4.5 Flow chart for the Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Flow chart for the Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 Power consumption via Series connected voltage domains under low load . . . . . . . 20
5.2 Power consumption via Series connected voltage domains under moderate load . . . 21
5.3 Power consumption via Series connected voltage domains under high load . . . . . . 21
5.4 Graphs for energy Consumption vs. Number of devices and Latency vs. Number of
devices, For Dataset sizes 2MB, 4MB, 6MB, 25MB and 50MB. . . . . . . . . . . . . 24
5.5 Graphs for Energy Split Up, For Dataset sizes 2MB, 4MB, 6MB, 25MB and 50MB. . 25
5.6 Latency Penalty vs. Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi
Chapter 1
Introduction
The continued increase in the number of transistors being packed in the same chip area results in
higher heat dissipation thereby causing Moore’s law to be threatened by the Power Wall. The power
wall has had an impact on all computing devices – mobile and datacenter alike. Both types of devices
present a different perspective of the problem. Several hardware innovations focused on better than
worst case designs [1] have been proposed in case of datacenter computing devices. However, the
constrained nature of mobile devices limits such hardware changes. Therefore, mobile devices rely
on software support for managing power and battery life. Moreover, many of the hardware changes
in case of datacenter computers also require software awareness. Hence, there is need for software
systems which focus on power and energy conservation.
1.1 Mobile Devices
Applications on mobile devices have been traditionally constrained by limited resources such as a
small memory and a battery-powered computing environment [2]. The battery-powered environment
restricts resource demanding applications such as document translation, information retrieval and
real-time gaming. Moreover, compute intensive applications such as complex media processing, large
scale data mining and machine learning have not been deployed onto mobile devices because of the
impact they would have on the battery of such a device.
One solution to this problem is to oﬄoad the computation onto a remote cloud and leverage
services such as Google Goggles [3], Google Voice [4] etc. This form of computing, also known
as mobile-cloud computing allows the deployment of several compute intensive applications on the
naturally resource restricted mobile devices. Unfortunately, high WAN latencies are a fundamental
obstacle to this approach. WAN delays in the critical path of user interaction can hurt usability by
degrading the crispness of system response [5]. Moreover, WAN latencies are unlikely to improve in
the near future due to heavy emphasis on security and manageability [6].
1
Wireless Bandwidth plays a major role for applications such as image processing and data mining
which process relatively large datasets. For example, a complex image recognition application
requires a high resolution image, this translates to bigger image sizes. Wireless WAN bandwidths are
roughly two orders of magnitude larger than wireless internet bandwidths (used for cloud access)
available on mobile devices. This translates to a transfer time of 80 milliseconds instead of 16
seconds for a 4 MB file. Hence, for such applications, mobile-cloud computing might not be the best
alternative.
In several situations, cloud access may also have a negative impact on application performance
or may not be available altogether. This is true in the case of military operations where compute
intensive applications like image recognition and voice recognition are being encouraged by the
government [7] [8]. Such a condition is also true when the infrastructure is unavailable due to a
natural disaster or where cloud access is extremely slow due to poor infrastructure.
An alternative to mobile cloud computing is mobile ad-hoc computing. The collective power of
trusted mobile devices can be leveraged when cloud access is not available or is inhibitive to appli-
cation performance. Battery life of these devices still remains a concern in case of mobile ad-hoc
computing. This thesis proposes Synergy, a middleware that reduces collective energy consumption
of networked devices connected in an ad-hoc peer-to-peer fashion. Synergy achieves this by intelli-
gently distributing the computation amongst the networked devices and by manipulating their core
voltages and frequencies. The system ensures that the application response latency is less than a
pre-determined, fixed threshold. Synergy also allows mobile devices to enter and exit the network
and dynamically adjusts to these changes. Hence, it does not limit the mobility of these devices.
Such a system might be extremely useful in an environment where battery life is critical and several
mobile devices move together in a group.
1.2 Datacenter Computing Devices
There has been a surge of low power microprocessor designs for datacenter computers. Most of
these designs advocate voltage underscaling [1] which results in timing errors occurring in hardware
modules. These techniques, though successful, might not be applicable for all applications; moreover,
the majority of power which is lost in the power delivery circuits due to the severe voltage step down
requirements is completely ignored by these designs. It has been shown that less than 25% of power
entering the data centers performs useful work [9]. These major power losses are caused by the
2
multiple DC conversion stages after the AC rectification to step down the bus voltage to a level
acceptable by the motherboard.
Figure 1.1: Voltage supply scaling trends of logic technologies with years (i) HP-High Performance
(ii)LOP-Low Operating Power (iii) LSTP-Low Stand By Power and (iv) III-V/Ge- High performance
The supply voltage in modern computing systems continues to scale down. According to Inter-
national Technology Roadmap for Semiconductors (ITRS), supply voltage will fall below 600 mV
by 2024 [10] as shown in figure 1. In general, the power consumed by processors and other digital
circuits is proportional to square of their input voltage [11], which gives an incentive to seek a de-
crease of operating voltages [12]. However, low voltage power is challenging and costly to supply.
For a given power input, power converter losses are inversely proportional to the square of the power
supply output voltage, which may offset the advantages of lower power consumption by the proces-
sor. Moreover, in traditional computing systems, digital loads in parallel share a common supply
voltage and hence suffer from large current slews. Connecting voltage domains in series allows the
device to be operated at a higher supply voltage [13,14] and hence obviates this problem.
In a series connected classification of components, it is important that the voltage draw by each
load is the same. This voltage drop (across the load) is dependent on its CPU utilization. Hence,
this thesis presents a generic load balancer-Equilibria - which actively monitors the CPU utilization
of the processors in the system and ensures equal voltage draw by all the processors. Equilibria runs
on a master processor and changes the frequency of the client processors based on the average CPU
utilization of all the processors in the system. The load balancer is generic, i.e. the system would
work irrespective of the software running on the processors.
3
Chapter 2
Background
This chapter outlines a brief background of basic power consumption on a chip. These concepts
justify the design decisions taken in this work and earlier research initiatives for power conservation.
2.1 Basics Of Power And Energy Conservation
Power consumed by a core can be split into Dynamic power and Static power [15]. The following
equation gives the relationship between dynamic power, frequency and voltage for a chip.
Powerdynamic ∝ 1/2× C × V 2 × f (2.1)
Where f is the frequency at which the transistors in the chip are switched, V is the voltage supplied to
the chip and C is the capacitive load. The capacitive load cannot be altered by software techniques;
however frequency and voltage can be altered with operating system support. Roughly, the core
voltage and frequency are directly proportional (up until recent chips):
Frequency ∝ V oltage (2.2)
From equation 2.1 we have:
Powerdynamic ∝ frequency3 (2.3)
Performance of an application is loosely defined as its execution time. Hence, for compute bound
applications, core frequency and performance hold a linear relationship. To reduce the temperature
of a particular chip, performance is sometimes traded off by reducing the core frequency thereby
reducing the power consumption (heat dissipated). Though heat dissipation is a problem in server
and laptop chips [16], mobile phones use ARM processors which are good at maintaining low on-chip
4
temperatures. The energy consumption is more critical in these chips as it determines the battery
life of the device. The following holds between energy, power and time:
Energy = Power × Time (2.4)
Therefore, although one is able to reduce power by reducing the frequency, this process decreases
the performance of an application and thereby increases its execution time. If one na¨ıvely reduces
the frequency to the lowest possible value, there is a possibility that time increases to such a large
extent that the total energy consumed by the device increases. Hence, in a multi-core environment,
it is required that the per-core frequency is intelligently set.
2.2 Power Savings Beyond Voltage Scaling
The technique of reducing the frequency and voltage to save power (mentioned above) is called
voltage underscaling. Power conservation through voltage underscaling has been successfully tried
and tested over the last decade. Undervolting the device causes timing errors which are detected
and corrected by hardware techniques such as those described in Razor [17]. These methods, might
cause performance overhead and demand significant changes to the hardware. Another alternative
is to allow the majority of errors to be masked at the architectural and micro-architectural level
and make the software more error resilient such that the errors that creep through these levels do
not significantly affect the application results. Such techniques, better termed Algorithmic noise
tolerance [18] and Stochastic computing [19] might require the redesign of existing applications.
Application and hardware re-design are tolerable side effects of any approach that conserves power.
One of the more hindering side effects of these designs is leakage energy. When the supply voltage Vdd
drops below the device threshold voltage (Vt), leakage energy starts to increase due to the exponential
increase in delay and becomes comparable with dynamic energy. This trade-off between leakage
energy and dynamic energy results in a minimum total energy point at an optimal Vdd < Vt. With the
continued decrease in supply voltage, the optimal value of Vdd is nearing Vt, resulting in diminishing
returns from voltage underscaling. Stacking 1 voltage domains in series is a complementary approach
which allows for power savings even for Vdd > Vt. Moreover, a series connection of voltage domains
allows voltage scaling with software only techniques thereby avoiding the latency and chip space
consumed by on-chip voltage regulators.
1Stacking here does not refer to 3-D stacking of circuit components.
5
Chapter 3
Energy Conservation On Mobile
Devices
Synergy is a middleware that conserves energy on a P2P network of mobile devices. It achieves
this by distributing workload and scaling the core frequency and voltage. Synergy can also allow
applications to achieve lower latency and share network between mobile devices. It is device and
operating system agnostic and adapts to device mobility. This chapter describes the design and the
decision making process of Synergy.
3.1 Analysis Of Energy Consumption In A Mobile P2P
Network
This section provides a brief analysis of energy consumption for data transmission in a peer-to-
peer mobile device system. Benchmarks were performed, stressing the Wi-Fi subsystem of the
devices to see how the energy consumption varies with the number of connected devices. The
peer-to-peer network was formed using AllJoyn [20]. The power consumption was measured using
PowerTutor [21]. PowerTutor is an open source project which is provided as an Android application
to measure power consumption in Android phones. The base source code was modified to specifically
monitor the Wi-Fi subsystem in detail.
Synergy does not use all the devices available in the system. The number of devices chosen
to distribute the computation is chosen based on a number of factors (refer 3.2). As a part of
the decision making process, Synergy requires a rough estimate of the amount of energy consumed
for data transmission between the devices. Since this energy is highly dependent on the devices,
the operating system and the network implementation (routing), benchmarks were used for this
purpose. The genre of applications considered for these tests are those that require computation to
be performed on data, since for truly compute intensive applications operating on very little or no
data, the network is used for remote procedure calls which have negligible overhead.
The tests were performed using Motorola Droid A855 phones running Android 2.2 (codenamed
6
 0
 50
 100
 150
 200
 250
 300
 1  2  3  4  5  6  7
E
ne
rg
y 
C
on
su
m
pt
io
n 
(J
)
Number of devices
Energy Consumption Vs Number of devices
4MB
8MB
16MB
32MB
Figure 3.1: Energy Consumption vs. Number of devices
Froyo). The results for energy consumption of the system versus the number of devices can be seen
in figure 3.1. Each test was performed 10 times and the 95% confidence intervals are shown. The
server1 sends out data to the client(s) which is then equally distributed among the clients. Each
client then sends back its received chunk to the server, as it would after the processing. The time
taken for the communication is shown in figure 3.2. It should be noted that for these benchmarks
there is no actual computation performed on the data by the clients or the server. Hence the time
shown here is the time taken only for communication.
It can be seen from figure 3.1 that for smaller amounts of data, the energy increases very slowly.
This is a peculiar result as it is expected that the amount of energy should increase when more
phones are added to the system even if the data is equally split, because a major cause of energy
consumption by the Wi-Fi subsystem is the Wi-Fi maintenance energy [22]. This result can be
explained by observing the results shown in figure 3.2. As can be seen, as the number of devices
increases, the communication time decreases until a threshold number of devices (4). This is because
the data is sent in the form of 128KB packets and the communication is multithreaded between
various clients. The run time system actively context switches between these available threads to
keep the Wi-Fi subsystem busy at all times. This allows the maximum bandwidth of the Wi-Fi to
be attained faster (when there are more devices) and hence decreases the communication time. This
effect wears out beyond 4 devices due to the limitations on the Wi-Fi bandwidth of the device. As
the number of devices increase, the time decreases but the Wi-Fi power increases. This is because
1For a particular session, the peer initiating the computation is regarded as the server and all the other devices as
clients.
7
 10000
 20000
 30000
 40000
 50000
 60000
 70000
 80000
 90000
 100000
 2  3  4  5  6
T
im
e 
(m
s)
Number of devices
Communication Latency Vs Number of devices
4MB
8MB
16MB
32MB
Figure 3.2: Communication Latency vs. Number of devices
due to the context switching of the threads, Wi-Fi now has the additional overhead of connecting
and disconnecting with the devices at a higher rate (the rate would be equal to the rate of thread
switching which is based on the run time system). The decrease in time nullifies this increase in
power, thereby keeping the energy almost constant. The effect of multithreading wears out at 4
devices and hence beyond 4 devices the energy of the system increases at almost the same rate as
the power of the system. For brevity, the power measurements are not plotted as they can be easily
obtained with the help of energy and time (refer equation 2.4).
3.2 Design Of Synergy
Synergy assumes root level access to the device for successfully carrying out frequency scaling.
When a device first joins the system, a nodeID is assigned to the device, this information is stored
in the system table shown in figure 3.3. Apart from the system table, Synergy also maintains a
decision table (shown in figure 3.4). Both the system table and the decision table are maintained
only by the device initiating the computation (server). Synergy does not pose any constraint on the
type of device i.e. the network can be formed between different types of devices running different
operating systems, for example between a mobile device running Android and a notebook running
Linux/Windows.
Synergy has the capability to run in two modes - local and distributed. In local mode, the appli-
cation runs as it would have run without Synergy, whereas in distributed mode, Synergy distributes
8
nodeID availability frequencystep1 ... ... frequencystepn
Figure 3.3: System table maintained by each device
number of clients frequency per device LatencyDiff EnergyDiff
Figure 3.4: Decision table maintained by Server
the computation among the clients. Synergy does not specifically guarantee a pre-determined power
budget, although it can be easily modified to do so. The amount of energy savings is not fixed and
highly depends on the application characteristics. The system can decide to run the application
locally and hence conserve no energy or run in a distributed mode with lower core frequencies, re-
sulting in energy conservation. Such a decision is highly application dependent and hence Synergy
allows this decision to be made by the server. The server can decide to give priority to latency or to
energy savings within a pre-defined threshold for latency penalty. Performance degradation causes
jitter which can be detrimental to a user’s experience in mobile devices [6]. Hence, by default
Synergy considers latency as the primary factor in any decision making process.
3.2.1 Device Discovery And Communication
Synergy uses an open source library called Alljoyn [20] for the formation of a peer-to-peer network.
Alljoyn creates a per-session server-client hierarchy in the peer-to-peer network. A session is defined
as the execution time of the computation which was initiated by a device running Synergy. The
device that initiates the computation is the server and the rest of the devices in the system are clients
for any given session. When a device needs to run an application, the application run command is
intercepted by Synergy. Synergy then multicasts an advertisement to all the devices connected on
the network. The other devices which are also running Synergy respond to this message with their
available frequency levels.
3.2.2 Device State
For each session, the server maintains a system table which has the information about the nodes in
the system and their available frequency steps. A typical row of this database contains the following
fields: nodeID, availability, frequencystep1.....frequencystepn . The nodeID represents the unique ID
of the node assigned by Synergy running on the server. The middleware ensures that the nodeID
does not reveal anything about the identity of that node to the server. NodeIDs in the system are
9
generated by computing a cryptographic hash of the node’s IP address. The frequencystep fields
help Synergy decide how many devices to choose and by what step to scale down the frequency of
the chosen devices. Synergy uses the heartbeat protocol to ensure that a device that was carrying
out part of the computation has not left the system. If the server discovers that a device has left the
system, the availability corresponding to the nodeID is marked FALSE. The portion of computation
lost is redistributed among the available clients. This ensures that mobility of such devices is not
restricted.
3.2.3 Computation Distribution
Synergy does not divide the data evenly among the server and all the chosen clients. This is
because if the data is divided evenly, then the server will end up finishing its computation faster
compared to the clients, as the clients have to pay for the communication latency. Hence to mask
this communication latency, the server keeps a bigger chunk of data with itself and gives out only a
smaller chunk of data. This smaller chunk of data is then evenly distributed among all the chosen
clients.
Amdahl’s law states that the speedup obtained by any parallel application is restricted by its
serial portion. The decision of overlapping communication with computation is a direct application
of this law. The serial portion in case of the middleware is the aggregation of the results from
the various clients. By using unequal distribution of data, the aggregation is overlapped with
computation on the server so that the serial portion of the application is minimized, resulting in
lower application latency when running in distributed mode.
Due to this, the core frequencies are also not scaled down by the same factor. The server, which
keeps more computation/data, scales down its core frequency by a smaller factor compared to the
clients, which get a lesser amount of data.
3.2.4 Frequency And Voltage Scaling
Since Synergy distributes the computation among peer devices, an alternative way to conserve energy
is to distribute the computation and run the peers at the maximum frequency instead of scaling the
frequencies. Since energy is the main concern, it is possible that because of the parallel speedup,
the decrease in the application’s execution time might be much more than the increase in power,
thereby conserving energy (refer equation 2.4). The design decision of frequency scaling was taken
because of the following reasons:
10
1. If the peers run at the maximum frequency with the data equally divided among all the peers
then the communication latency cannot be masked by computation. This design decision
(highlighted above) is one of the key design decisions of Synergy.
2. If the peers are run at the maximum frequency with unequal division of data then the system
has to wait for the peer which has the maximum data. Since the system anyways has to wait,
it makes sense to scale the frequency in correspondence to the computation that a peer has to
do.
In some cases it might be required that the devices available in the system be used for their
sheer compute power. This is true when the application is concerned only about the latency and
not about energy savings. For this purpose, Synergy provides a MAX-Distributed mode of operation.
The MAX-Distributed mode is a complimentary mode to the distributed mode with the following
difference – all the nodes run at the maximum frequency and the data is equally divided.
3.2.5 Algorithm Design
This section describes the main algorithm using which Synergy decides the number of clients to be
chosen, the factor by which the core frequencies should be scaled and the amount of data to be
distributed. The pseudo code is given in Algorithm 1. For compute intensive applications, it can be
assumed that the time taken by the application is given as follows
Timeapplication ∝ Data
CoreFrequency
(3.1)
It is assumed that the value of Data is given by the application developer. This avoids the
need for Synergy to know the actual computation being performed. Synergy starts by scaling the
server frequency by one level. The ratio of the frequency step chosen to the maximum frequency
for the server determines the ratio of the amount of data kept by the server to the amount of data
distributed among the clients, i.e.
Frequencystep
Frequencymax
∝ DataServer
DataClients
(3.2)
Synergy then iterates over another loop where the loop iterator (num clients) runs from 1 to n-1
where n is the total number of devices in the system. For each value of the number of clients chosen,
11
the following equations are used to determine LatencyDiff and EnergyDiff .
LatencyDiff = x · D
f
+ (Tc + (1− x) · Ddfcliente )−
D
Fmax
(3.3)
and
Ediff = f
3 · x · D
f
+ (Ec +
(1− x) ·D
dfcliente · dfcliente
3
)− D
Fmax
· Fmax3 (3.4)
Where f is the current frequency step chosen by the server, x is the fraction of data kept
by the server (which is determined by equation 3.2), D is the total amount of data, Tc is the
time taken for communication, Ec is the energy taken for communication and fclient is the client
frequency determined by the value of data distributed among the clients. The values of Tc and Ec
are determined by the benchmark results shown in figure 3.2 and 3.1.
Equation 3.3 helps minimizing the latency when the application is run in the distributed mode,
it is derived with the help of the assumption stated in equation 3.1. Equation 3.4 is used to minimize
the energy. Equation 3.3 is considered satisfied if the value of LatencyDiff is less than a pre-defined
threshold (10% of the local run time by default). Equation 3.4 is considered satisfied if the value of
EDiff is less than zero. If both the equations are satisfied for a particular value of the num clients,
an entry is made in the decision table. The decision table stores the number of devices, the per
core frequency of each device, the value of LatencyDiff and the value of EDiff . Once the loop
iterates over all the devices, the next lower frequency step of the server is chosen and the process is
repeated. The complexity of the algorithm is O(m · n), where m is the number of frequency steps
of the server and n is the number of clients available in the system. Once this step is complete, the
default action of Synergy is to choose the entry from the decision database which minimizes the
latency and maximize the energy savings. If no such entry is available, Synergy decides to run the
application in local mode. The threshold value of latency(10% of the local run time by default) is
highly application dependent and is provided as an adjustable knob.
12
Algorithm 1 Algorithm Synegy uses to calculate data distribution and core frequency scaling
Fmax ←Maximum frequency server
n← Total devices
f ← Fstep1
D ← Total data
x← Fraction data server
Tc ← Time communication
Ec ← Energy communication
for f ← Fstep1 to FlastStep do
for num clients← 1 to n− 1 do
x = fFmax
fclient = (1− x) · FmaxClient
LatencyDiff = x · Df + (Tc + (1− x) · Ddfcliente )− DFmax
Ediff = f
3 · x · D
f
+ (Ec +
(1−x)·D
dfcliente · dfcliente
3)− D
Fmax
· Fmax3
if LatencyDiff ≤ 0.1 · DFmax && Ediff < 0 then
Decision Table← num devices
Decision Table← fclient
Decision Table← f
Decision Table← LatencyDiff
Decision Table← Ediff
end if
end for
end for
13
Chapter 4
Energy Conservation In
Datacenters
This chapter describes a system that is designed for saving power in datacenter computers. Data-
center computers are more amenable to hardware changes, hence several hardware innovations have
surfaced which decrease the amount of power consumed by a datacenter computing device. One of
these hardware innovations is connecting the voltage domains (cores - in case of multi core chips) in a
series circuit arrangement instead of a parallel circuit. A series connection of voltage domains helps
save power during power delivery, however software awareness is required to make such a hardware
technique work. This chapter describes the design of Equilibria which is a system that runs on top
of series connected voltage domains and aids the hardware in saving power.
4.1 Why Series Connected Voltage Domains
The modern day multi-core chip is organized as several cores connected in a parallel circuit. As
such, connecting these cores in parallel ensures a common voltage supply to all the cores, hence
ensuring that the voltage draw by a core is never above the safe value. To introduce different
voltage levels, each core is connected to one or more voltage regulators (these regulators might be
present on chip for low latency [23]). The modern day operating system is equipped with several
frequency governors which scale the voltage of the core depending on the load [24–26]. The lower
voltage levels stress the power delivery circuits, this in turn leads to power losses due to inefficient
power delivery. Connecting the voltage domains in series solves this problem [13, 14]. Some of the
advantages of this approach are:
1. In such a design, the input voltage is high and the current is low. For example, a 64 core
server designed to work at 30W and 0.3V requires a power supply rated at 0.3V and 100A for
a parallel arrangement. The same server requires a power supply rated at 19.2V and 1.6A when
connected in series (refer figure 4.1). Such a power supply suffers from lower node-by-node
ripple and has better transient response.
14
(a) (b)
Figure 4.1: Comparison of a 64 core, 30W server with cores connected in (a) Parallel and (b)
Series. . .
2. The series arrangement also reduces the power lost in the power conversion stages. For exam-
ple, a typical conversion process may start with 120V at 60Hz and may step down to 1V, 50A
in three power conversion stages. Even if each stage is 90% efficient, only 73% of the incoming
power is delivered to the processor. In case of a series connection, one of these conversion
stages can be removed and another stage can be made substantially more efficient [12]. A
conversion process that includes one conversion stage with 90% efficiency followed by another
with 94% efficiency, yields an overall efficiency of 85%, thereby cutting nearly half the power
lost by the three stage conversion process.
3. Sensitivity to power variations is greatly reduced in a series arrangement. Conventional pro-
cessors require large number of decoupling capacitors to support extreme current slews. At
low voltages, capacitors store limited energy and do not perform the job well. In contrast, at
higher voltages in series connections, the current slew rates are far lower and the capacitor
energy storage is far higher [12].
15
Figure 4.2: Portion of Raspberry Pi schematic
4.2 Base Power Measurements
4.2.1 Processor
For the software to make efficient scheduling decisions, the system requires accurate power mea-
surements of the processors. Moreover, the system of processors needs to be connected in a series
connection and hence requires hardware manipulation. The Raspberry-Pi [27] was chosen as the
base processor for this work, since none of the off-the-shelf chips have a series connected arrangement
nor do they lend themselves to easy manipulation of the circuit. However, the Raspberry-Pi allows
for relatively easy manipulation of the circuit.
4.2.2 Power Measurement Methods
This sub-section describes the methods used to establish a baseline power measurement for the
Raspberry Pi. These power measurements are used by Equilibria to make load balancing decisions.
The Raspberry Pi chip itself is a Broadcom BCM2835, containing an ARM1176JZFS CPU, a
Vidcore 4 GPU, and 512MB of integrated (POP) RAM. According to the schematic, the BCM2835
contains an integrated switch-mode DC/DC converter, which provides power to the core. Based on
measurements taken directly across the core input voltage (capacitor C16 from Figure 4.2) across
the full range of CPU load and frequency, the BCM2835’s integrated converter is configured to
output only two voltages: 1.227V and 1.337V. The voltage level transition from low to high occurs
(regardless of CPU load) as the processor frequency is increased beyond 700MHz.
The measurements of total input power, compared with power usage calculations of several major
16
 0.25
 0.3
 0.35
 0.4
 0.45
 650  700  750  800  850  900  950  1000
P
o
w
e
r(
W
a
tt
s
)
Frequency(MHz)
Raspi Power consumption with Freq scaling
Power consumption
Figure 4.3: Power draw with frequency scaling
for the Raspberry Pi
 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 1.05
 0  2  4  6  8  10
R
a
ti
o
Num Reading
Frequency and power ratio for the Raspberry-Pi
Frequency Ratio
Power Ratio
Figure 4.4: Frequency and power ratio for the
Raspberry-Pi
components of the Raspberry Pi board indicate that the SoC itself consumes in some cases as little
as 60% of total input power. Thus, a precise characterization of the relationship between SoC power
usage and CPU load cannot be determined or even estimated by a measurement of total board input
power.
The board is not designed for direct SoC power measurement. Hence, the layout was modified
slightly. Voltage measurements of the SoC were taken by soldering two zero-ohm leads across
capacitor C16 (refer figure 4.2), and utilizing a Fluke 45 bench meter. Current measurements
were taken by de-soldering one lead of inductor L5, and inserting zero-ohm leads to take a shunt
measurement of the current. This was accomplished with a separate Fluke 45 bench meter.
The results of the power measurement are shown in Figure 4.3. Each reading was taken 50 times
and the 95% confidence intervals are shown. As can be seen from the figure, there is a sudden
jump in the power draw when the frequency is increased beyond 700MHZ. This is because when
the frequency increases beyond 700MHZ, the input voltage to the core is increased from 1.227V to
1.337V. The rise is steep since the power consumed is directly proportional to the square of the
voltage. However, the Raspberry-Pi has only two voltage levels and beyond the first frequency step
(725MHZ) the input voltage to the core does not increase any further. Hence the increase in the
power draw beyond this point is proportional to the increase in frequency. Figure 4.4 shows the
ratio of the frequency and power between readings n and n-1. It is evident from this graph that
beyond the first frequency step of 725MHZ, the increase in the power draw is proportional to the
increase in the core frequency.
17
4.3 Software Design For Series Connected Voltage Domains
The system consists of a master node and several client nodes. The master node is a dedicated
Raspberry-Pi responsible for making all the decisions for frequency scaling of the clients in the
system. The client nodes execute the commands sent by the master node. The nodes communicate
with each other using an off the shelf 8 port switch. The subsections below explain the roles and
functionalities of each of the node types in detail.
4.3.1 Master Node
The main responsibility of the master node is to continuously listen for updates from the client
nodes. The updates are the new CPU utilization values sent to the master by the clients. Once
the master receives the new CPU utilization value from a particular client, it proceeds to calculate
the new average CPU utilization of the system. If the increase or decrease in the average CPU
utilization of the system is more than 10% per client in the system, the master determines the new
frequency for all the clients in the system using equation 4.1. The master then calculates the closest
frequency step to the new frequency and sends it to all the client nodes. The process of sending the
new frequency and listening for the new CPU utilization from the clients take place in two parallel
threads. The flow chart in Figure 4.5 shows the master’s decision making process diagrammatically.
SystemFreqnew
SystemFreqold
∝ AvgCPUutilizationnew
AvgCPUutilizationold
(4.1)
4.3.2 Client Nodes
Each client node continuously monitors its own CPU utilization. If the node detects that its CPU
utilization has increased or decreased by more than 10%, it pushes a notification with the new CPU
utilization to the master. This threshold of 10% is used so that each client does not keep pushing
notifications to the master for small changes in the CPU utilization. This also helps decrease the
traffic on the network and utilize the network bandwidth effectively. Moreover, the master node
will not make a decision to change the frequency of all the clients in the system unless the CPU
utilization of one of the clients increases or decreases by more than 10%. Hence, it makes sense to
push notifications only when this condition is met.
The second job of the client is to listen for the updated frequency from the master. As soon as
the clients receive an updated frequency from the master, they check if their current frequency is
18
Figure 4.5: Flow chart for the Master
equal to the new desired frequency. If the current frequency is not equal to the desired frequency,
the clients update their frequency and send an acknowledgment to the master. Both these processes
of listening for a new frequency and monitoring the CPU utilization occur in two parallel threads.
The flow chart in Figure 4.6 shows the client’s decision making process diagrammatically.
Figure 4.6: Flow chart for the Clients
19
Chapter 5
Results
5.1 Power Savings In Datacenter Computers
A system of 3 Raspberry-Pis was networked using an off the shelf 8 port switch, to test power savings
via series connected voltage domains using Equilibria. Figures 5.1, 5.2 and 5.3 show the graphs of the
power consumption of the system with varying average CPU utilization. The figures show simulation
results - the processors were not actually connected in a series circuit. Two lines, assuming power
savings of 15% and 20% by the power delivery circuits are depicted in the graphs. 15% and 20% are
conservative estimations of power savings in a series connected set up of voltage domains, higher
power savings can be expected in case of more cores connected in a series circuit [13,14].
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 0  5  10  15  20  25  30  35
P
ow
er
(W
)
Average CPU Utilization(%)
Power consumption of the System with 1 Processor working
Non-series
Series-No-Reduction
Series-15%-Reduction
Series-20%-Reduction
Figure 5.1: Power consumption via Series connected voltage domains under low load
Figure 5.1 represents low load, with only 1 processor being able to handle all the tasks. In
this situation the non-series setup has the ability to switch off the other two Raspberry-Pis to save
power. This is however not possible in case of series connected voltage domains since the voltage
20
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 0  10  20  30  40  50  60  70
P
ow
er
(W
)
Average CPU Utilization(%)
Power consumption of the System with 2 Processors working
Non-series
Series-No-Reduction
Series-15%-Reduction
Series-20%-Reduction
Figure 5.2: Power consumption via Series connected voltage domains under moderate load
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 0  20  40  60  80  100
P
ow
er
(W
)
Average CPU Utilization(%)
Power consumption of the System with 3 Processors working
Non-series
Series-No-Reduction
Series-15%-Reduction
Series-20%-Reduction
Figure 5.3: Power consumption via Series connected voltage domains under high load
drop across each load is required to be the same. Figure 5.2 represents medium load with two
processors running. This graph shows better power savings for series connected voltage domains.
However, the best results from series connected voltage domains are obtained when the system is
under high load. These results are shown in figure 5.3. The amount of power savings achieved under
high load is the same as the amount of power savings made by the power delivery circuits which
could be anywhere from 15% to 30% depending on the number of loads connected in series.
21
5.2 Energy Savings On Mobile Devices
A prototype implementation of Synergy was developed for the Android operating system. Two
applications - image smoothing and video processing were used for evaluating Synergy’s design.
The video processing application was a face recognition application for a video, this application was
implemented using OpenCV. Integration of the OpenCV library with Synergy allows an application
developer to write image and video processing applications using the OpenCV library. Both these
applications are representative of a huge class of compute intensive media processing applications.
Two partition (and aggregation) functions are provided, one specialized for image partitioning and
the other for video partitioning. These functions take care of partitioning the dataset, and aggre-
gating the results. The application developer can chose one of the two partition functions provided
with Synergy or provide her own partition and aggregation functions.
All the tests were carried out on Google Nexus S 4G phones running Android 2.2. Each test was
carried out 5 times and the graphs show the 95% confidence intervals. PowerTutor [21] was used
for energy consumption measurements. PowerTutor does not take voltage scaling into account. An
Android kernel which provides an interface to read the voltage values corresponding to the frequency
steps available was installed. Also, PowerTutor was modified to take voltage scaling into account in
its power readings. This only changed the power readings of the CPU. The voltage-frequency table
for the Nexus 4G is shown in table 5.1. The device frequencies were not scaled below 400MHz. This
is because below this value, PowerTutor missed the power readings for some iterations. In other
words, PowerTutor’s readings were not accurate when the core frequency was scaled below 400MHz.
Table 5.1: Frequency-Voltage Table
Frequency(MHz) V oltage(mV )
1400 1450
1300 1400
1200 1350
1000 1250
800 1200
400 1050
200 950
100 950
5.2.1 Image Processing
Image smoothing was implemented as a representative for image processing applications. The image
smoothing task smoothes an image by averaging pixels using values of neighboring pixels. The
22
averaging process is repeated several times in order to achieve a high level of smoothing. The
evaluations for image smoothing were performed for 3 different image sizes: 2MB, 4MB and 6MB.
The graphs for energy consumption vs. number of devices for each of the image sizes are shown
in figure 5.4. As can be seen, Synergy was able to conserve 26.9%, 30.6% and 29.6% of energy for
2MB, 4MB and 6MB image sizes respectively. Also, the corresponding latency hits (also shown in
graphs in the same figure) were within 5%. As the dataset size increases, the energy consumption
curve shows a plateau effect with an increase in the number of devices. This can be attributed to
the fact that energy savings are eventually limited by rising Wi-Fi communication costs.
5.2.2 Video Processing
The video processing application recognizes faces in a video feed. This processing can be used to
recognize objects and landmarks among several other things, depending upon the training set of the
classifier. The processing was done oﬄine because such a compute intensive task cannot be performed
online on a mobile device. An attempt was made to carry out the processing on a live video feed but
the jitter was intolerable. Videos of two sizes were processed: 25MB and 50MB. Figure 5.4 shows
the graph for energy consumption vs. number of devices. For the task of video processing, Synergy
always decided to run in local mode. The graphs show that the energy consumption in distributed
mode was higher than that in local mode thereby advocating Synergy’s decision of running the
application in local mode. This can be seen for two, three and even four devices. Graphs shown
in figure 5.5 explain this behavior. These graphs show the split of energy consumption for clients,
server and the Wi-Fi subsystem. As can be seen, the Wi-Fi energy consumption is much higher
than the decrease in energy consumption due to frequency scaling. Due to this, the system is unable
to conserve energy in distributed mode.
For applications such as this, Synergy can be used in the MAX-Distributed mode. In this mode,
Synergy splits the data equally and runs each device in the system at its maximum frequency. The
energy consumption for this mode is also shown in figure 5.4. This mode is primarily for achieving
lower latencies compared to local mode and not for energy savings. As can be seen from the figures,
the energy consumption for this mode is much higher, but the system is able to reduce the latency
by 38.24% and 31.71% for 25MB and 50MB video processing, respectively.
23
 20
 30
 40
 50
 60
 70
 1  2  3  4
E
n
e
r
g
y
 C
o
n
s
u
m
p
ti
o
n
 (
J
)
Number of devices
Energy Consumption Vs Number of devices - 2MB
Distributed Mode
Local Mode
 40
 50
 60
 70
 80
 90
 100
 1  2  3  4
E
n
e
r
g
y
 C
o
n
s
u
m
p
ti
o
n
 (
J
)
Number of devices
Energy Consumption Vs Number of devices - 4MB
Distributed Mode
Local Mode
 60
 80
 100
 120
 140
 160
 1  2  3  4
E
n
e
r
g
y
 C
o
n
s
u
m
p
ti
o
n
 (
J
)
Number of devices
Energy Consumption Vs Number of devices - 6MB
Distributed Mode
Local Mode
 150
 200
 250
 300
 1  2  3  4
E
n
e
r
g
y
 C
o
n
s
u
m
p
ti
o
n
 (
J
)
Number of devices
Energy Consumption Vs Number of devices - 25MB
Distributed Mode
MAX Distributed Mode
Local Mode
 300
 350
 400
 450
 500
 550
 600
 1  2  3  4
E
n
e
r
g
y
 C
o
n
s
u
m
p
ti
o
n
 (
J
)
Number of devices
Energy Consumption Vs Number of devices - 50MB
Distributed Mode
MAX Distributed Mode
Local Mode
 20
 25
 30
 35
 40
 1  2  3  4
L
a
te
n
c
y
 (
s
e
c
)
Number of devices
Latency Vs Number of devices - 2MB
Distributed Mode
Local Mode
 45
 50
 55
 60
 65
 70
 75
 1  2  3  4
L
a
te
n
c
y
 (
s
e
c
)
Number of devices
Latency Vs Number of devices - 4MB
Distributed Mode
Local Mode
 70
 75
 80
 85
 90
 95
 100
 1  2  3  4
L
a
te
n
c
y
 (
s
e
c
)
Number of devices
Latency Vs Number of devices - 6MB
Distributed Mode
Local Mode
 40
 60
 80
 100
 120
 140
 160
 1  2  3  4
L
a
te
n
c
y
 (
s
)
Number of devices
Latency Vs Number of devices - 25MB
Distributed Mode
MAX Distributed Mode
Local Mode
Figure 5.4: Graphs for energy Consumption vs. Number of devices and Latency vs. Number of
devices, For Dataset sizes 2MB, 4MB, 6MB, 25MB and 50MB.
24
Figure 5.5: Graphs for Energy Split Up, For Dataset sizes 2MB, 4MB, 6MB, 25MB and 50MB.
25
5.2.3 Latency-Energy Tradeoff
Figure 5.6 shows the latency penalty vs. energy savings. The latency penalty is negative for
a few configurations because Synergy oversubscribes the system. This means that because the
communication takes place over Wi-Fi, it is ensured that the frequency which the devices are scaled
to is the closest frequency step higher than the calculated frequency. This results in a higher
cumulative frequency than the maximum frequency available to the device locally. It can be seen
from figure 5.6 that there exists an optimal point between latency penalty and energy savings.
Beyond this point, forcing Synergy to run in the distributed mode would result in diminishing
returns. The distributed mode would still be able to save energy but the latency hit would become
intolerable. This should be kept in mind when designing an application using Synergy. By default,
Synergy defines 10% as the latency penalty threshold but the application developer has the freedom
to alter this threshold.
5.3 Synergy’s Limitations
This section highlights Synergy’s limitations and presents some future directions which might help
minimize them.
 10
 12
 14
 16
 18
 20
 22
 24
 26
 28
 30
 32
 34
-14 -12 -10 -8 -6 -4 -2  0  2  4  6  8  10  12
E
ne
rg
y 
S
av
in
gs
 (
%
)
Latency Penalty (%)
Latency Penalty Vs Energy Savings
6MB
4MB
2MB
Figure 5.6: Latency Penalty vs. Energy Savings
26
5.3.1 Parallelism
The pre-requisite for any application to expect energy savings from Synergy is that the application
should be parallelizable. A strictly sequential application will be run in local mode and will not
contribute to any energy savings. Moreover, the amount of parallelism available in the application
is also important. Since the parallel portions are distributed over a wireless network, it is required
that this distribution be done only when the application has a significant parallel portion. This is
because a significant amount of cost is paid in terms of energy for the distribution among peers. If
this cost is not amortized by the size of the parallel portion then Synergy will end up consuming
more energy due to the Wi-Fi energy outweighing the computation energy. To obviate this problem,
methods for an application developer to explicitly denote the parallel portion(s) of her applications
can be provided in the future versions of Synergy. Synergy can then use this information in the
decision making process of running the application in distributed or local mode.
5.3.2 Communication
Synergy cannot help an application conserve energy if the parallel portions of the application require
significant amount of communication between each other. This is because the speedup obtained by
any application is also constrained by the communication required by its parallel portions. Moreover,
since communication has to take place over a wireless network, the communication latency is large.
Hence, for any application requiring large amount of communication between its parallel portions (a
typical message passing application) Synergy will not only suffer from poor application performance
(in terms of latency) but would also be unable to conserve energy due to the frequent use of Wi-Fi.
A message passing model is not provided in the current implementation of Synergy. Limited amount
of message passing can still be achieved through Synergy but explicit message passing constructs
such as SEND and RECIEVE are not provided in the current implementation.
5.3.3 Data Privacy
Synergy can lead to data privacy concerns among its users since the data is now distributed among
peer devices. Even though the peer has only a part of the data, in some circumstances it is possible
to decipher the complete data with a small amount of information. Synergy provides an encryption
mechanism to limit this concern. Encryption, if used, limits the performance in terms of application
27
latency since computation on encrypted data is not straight forward. Stricter privacy and security
measures can be provided in future implementations of Synergy.
5.3.4 Non-CPU Intensive Applications
Synergy targets CPU intensive applications. Only applications that are compute intensive can gain
from Synergy. Applications such as a mail client, games with low CPU utilization, audio applications
etc. cannot hope to save energy using Synergy.
28
Chapter 6
Conclusion
This thesis presents two systems, Synergy and Equilibria - for energy conservation in mobile devices
and datacenter computers respectively.
Synergy is a middleware for distributing computation over mobile ad-hoc networks. This thesis
shows that in cases where the cloud is not available, mobile ad-hoc networks can be used as a
temporary alternative. Synergy is designed with energy conservation as the primary goal. This is
because in many circumstances when the cloud is not available, the battery life of mobile devices
becomes critical (disaster management, military groups etc.). Using Synergy, considerable amount of
energy can be saved when mobile devices work together as one system rather than individual devices.
Two compute intensive applications on a prototype implementation of Synergy were developed
to show the energy savings possible using such a system. Synergy also provides suggestions for
application developers to achieve better energy savings. Such a middleware for mobile ad-hoc
networks is not a substitute for clouds or cloudlets, but should coexist with them to improve the
quality of mobile computing.
Equilibria is a generic load balancer which runs on a network of series connected voltage do-
mains. This thesis shows that connecting voltage domains in series is a different approach to power
conservation compared to previous methods which rely on voltage and frequency scaling. Power
savings using Equilibria running on a network of Raspberry-Pis are shown. Series connected voltage
domains work best under high load and when several cores are connected in series. This thesis also
shows that Equilibria has the ability to handle variation in CPU utilization while protecting the
hardware in a series connection of loads.
29
References
[1] Todd Austin, Valeria Bertacco, David Blaauw, and Trevor Mudge. Opportunities and challenges
for better than worst-case design. In Proceedings of the 2005 Asia and South Pacific Design
Automation Conference, ASP-DAC ’05, pages 2–7, New York, NY, USA, 2005. ACM.
[2] Xinwen Zhang, Sangoh Jeong, Anugeetha Kunjithapatham, and Simon Gibbs. Towards an
elastic application model for augmenting computing capabilities of mobile platforms. In Mobile
Wireless Middleware, Operating Systems, and Applications, volume 48 of Lecture Notes of
the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering,
pages 161–174. Springer Berlin Heidelberg, 2010.
[3] Google Inc. Google goggles. http://www.google.com/mobile/goggles/.
[4] Google Inc. Google speech to text. https://www.google.com/speech-api/.
[5] H. Lagar-Cavilla, Niraj Tolia, Eyal de Lara, M. Satyanarayanan, and David O’Hallaron. In-
teractive resource-intensive applications made easy. In Renato Cerqueira and Roy Campbell,
editors, Middleware 2007, volume 4834 of Lecture Notes in Computer Science, pages 143–163.
Springer Berlin / Heidelberg, 2007.
[6] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies. The case for vm-based cloudlets in
mobile computing. Pervasive Computing, IEEE, 8(4):14 –23, oct.-dec. 2009.
[7] Edwin Morris. A new approach for handheld devices in the military. 2011.
[8] Grace Lewis. Cloud computing for the battlefield. 2011.
[9] Power Sources Manufacturers’ Association (PSMA). Follow the power – a report on an energy
efficiency workshop from generator to integrated circuit. February 2007.
[10] 2011 ITRS process integration, devices, and structures. http://www.itrs.net/Links/
2011ITRS/2011Chapters/2011PIDS.pdf.
[11] T. Mudge. Power: a first-class architectural design constraint. Computer, 34(4):52 –58, apr
2001.
[12] P.S. Shenoy, I. Fedorov, T. Neyens, and P.T. Krein. Power delivery for series connected volt-
age domains in digital circuits. In Energy Aware Computing (ICEAC), 2011 International
Conference on, pages 1 –6, 30 2011-dec. 2 2011.
[13] S. Rajapandian, Zheng Xu, and K.L. Shepard. Implicit dc-dc downconversion through charge-
recycling. Solid-State Circuits, IEEE Journal of, 40(4):846 – 852, april 2005.
[14] S. Rajapandian, K.L. Shepard, P. Hazucha, and T. Karnik. High-voltage power delivery through
charge recycling. Solid-State Circuits, IEEE Journal of, 41(6):1400 – 1410, june 2006.
[15] N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir,
and V. Narayanan. Leakage current: Moore’s law meets static power. Computer, 36(12):68 –
75, dec. 2003.
30
[16] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug
Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News,
39(3):365–376, June 2011.
[17] Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad
Ziesler, David Blaauw, Todd Austin, Krisztian Flautner, and Trevor Mudge. Razor: A low-
power pipeline based on circuit-level timing speculation. In Proceedings of the 36th annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 36, pages 7–, Washington,
DC, USA, 2003. IEEE Computer Society.
[18] Rajamohana Hegde and Naresh R. Shanbhag. Energy-efficient signal processing via algorithmic
noise-tolerance. In Proceedings of the 1999 international symposium on Low power electronics
and design, ISLPED ’99, pages 30–35, New York, NY, USA, 1999. ACM.
[19] Naresh R. Shanbhag, Rami A. Abdallah, Rakesh Kumar, and Douglas L. Jones. Stochastic
computation. In Proceedings of the 47th Design Automation Conference, DAC ’10, pages 859–
864, New York, NY, USA, 2010. ACM.
[20] Qualcomm Inc. Alljoyn. https://www.alljoyn.org/.
[21] Lide Zhang, Birjodh Tiwana, Zhiyun Qian, Zhaoguang Wang, Robert P. Dick, Zhuoqing Mor-
ley Mao, and Lei Yang. Accurate online power estimation and automatic battery behavior
based power model generation for smartphones. In Proceedings of the eighth IEEE/ACM/IFIP
international conference on Hardware/software codesign and system synthesis, CODES/ISSS
’10, pages 105–114, New York, NY, USA, 2010. ACM.
[22] Niranjan Balasubramanian, Aruna Balasubramanian, and Arun Venkataramani. Energy con-
sumption in mobile phones: a measurement study and implications for network applications. In
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, IMC
’09, pages 280–293, New York, NY, USA, 2009. ACM.
[23] Intel Corp. Intel Corei7 architecture specifications. http://www.intel.com/products/
processor/corei7/.
[24] Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen, Ram Rajamony, and Raj Rajkumar.
Critical power slope: understanding the runtime effects of frequency scaling. In Proceedings
of the 16th international conference on Supercomputing, ICS ’02, pages 35–44, New York, NY,
USA, 2002. ACM.
[25] Bo Zhai, David Blaauw, Dennis Sylvester, and Krisztian Flautner. Theoretical and practi-
cal limits of dynamic voltage scaling. In Proceedings of the 41st annual Design Automation
Conference, DAC ’04, pages 868–873, New York, NY, USA, 2004. ACM.
[26] Gang Qu. What is the limit of energy saving by dynamic voltage scaling ? In Proceedings of
the 2001 IEEE/ACM international conference on Computer-aided design, ICCAD ’01, pages
560–563, Piscataway, NJ, USA, 2001. IEEE Press.
[27] The Raspberry Pi Foundation. Raspberry-pi. http://www.raspberrypi.org/.
31
