Determining application-specific peak power and energy requirements for ultra-low-power processors by Ye, Weidong
c© 2017 Weidong Ye
DETERMINING APPLICATION-SPECIFIC PEAK POWER AND
ENERGY REQUIREMENTS FOR ULTRA-LOW-POWER
PROCESSORS
BY
WEIDONG YE
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Associate Professor Rakesh Kumar
ABSTRACT
Many emerging applications such as IoT, wearables, implantables, and sen-
sor networks are power- and energy-constrained. These applications rely
on ultra-low-power processors that have rapidly become the most abundant
type of processor manufactured today. In the ultra-low-power embedded
systems used by these applications, peak power and energy requirements are
the primary factors that determine critical system characteristics, such as
size, weight, cost, and lifetime. While the power and energy requirements
of these systems tend to be application-specific, conventional techniques for
rating peak power and energy cannot accurately bound the power and en-
ergy requirements of an application running on a processor, leading to over-
provisioning that increases system size and weight.
In this thesis, we present an automated technique that performs hardware-
software co-analysis of the application and ultra-low-power processor in an
embedded system to determine application-specific peak power and energy re-
quirements. Our technique provides more accurate, tighter bounds than con-
ventional techniques for determining peak power and energy requirements,
reporting 15% lower peak power and 17% lower peak energy, on average, than
a conventional approach based on profiling and guardbanding. Compared to
an aggressive stressmark-based approach, our technique reports power and
energy bounds that are 26% and 26% lower, respectively, on average. Also,
unlike conventional approaches, our technique reports guaranteed bounds on
peak power and energy independent of an application’s input set. Tighter
bounds on peak power and energy can be exploited to reduce system size,
weight, and cost.
ii
ACKNOWLEDGMENTS
This work was supported in part by NSF, SRC, and CFAR, within STAR-
net, a Semiconductor Research Corporation program sponsored by MARCO
and DARPA. The author thanks anonymous reviewers and Professor Lizy
John for their suggestions and feedback, and Himanshu Shekhar Sahoo, who
performed testing of ULP processors for Chapter 2.
The text of this thesis is a reprint with permission of the material as it
appears in the proceedings of the 22nd ACM International Conference on
Architectural Support for Programming Languages and Operating Systems
(April 2017). The thesis author was a co-primary researcher and author, and
the co-authors involved in the submission directed the research which forms
this thesis.
iii
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 A CASE FOR APPLICATION-SPECIFIC INPUT-
INDEPENDENT PEAK POWER AND ENERGY REQUIRE-
MENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER 3 APPLICATION-SPECIFIC INPUT-INDEPENDENT
PEAK POWER AND ENERGY . . . . . . . . . . . . . . . . . . . 11
3.1 Input-Independent Gate Activity Analysis . . . . . . . . . . . 11
3.2 Input-Independent Peak Power Requirements . . . . . . . . . 13
3.3 Input-Independent Peak Energy Requirements . . . . . . . . . 15
3.4 Validation of X-based Analysis . . . . . . . . . . . . . . . . . . 16
3.5 Enabling Peak Power Optimizations . . . . . . . . . . . . . . . 19
CHAPTER 4 METHODOLOGY . . . . . . . . . . . . . . . . . . . . 22
4.1 Simulation Infrastructure and Benchmarks . . . . . . . . . . . 22
4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CHAPTER 5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CHAPTER 6 GENERALITY AND LIMITATIONS . . . . . . . . . . 34
CHAPTER 7 RELATED WORK . . . . . . . . . . . . . . . . . . . . 36
CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 38
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
CHAPTER 1
INTRODUCTION
Ultra-low-power (ULP) processors have rapidly become the most abundant
type of processor in production today. New and emerging power- and energy-
constrained applications such as the internet-of-things (IoT), wearables, im-
plantables, and sensor networks have already caused production of ULP pro-
cessors to exceed that of personal computers and mobile processors [1]. The
2015 ITRS report projects that these applications will continue to rely on
simple single-core ultra-low-power processors in the future, will be powered
by batteries and energy harvesting, and will have even tighter peak power
and energy constraints than the power- and energy-constrained ULP systems
of today [2]. Unsurprisingly, low-power microcontrollers and microprocessors
are projected to continue being the most widely-used type of processor in the
future [1, 3, 4, 5].
ULP systems can be classified into three types based on the way they are
powered [6]. As illustrated in Figure 1.1, some ULP systems are powered di-
rectly by energy harvesting (Type 1), while some are battery-powered (Type
3). Another variant is powered by a battery and uses energy harvesting to
charge the battery (Type 2).
Figure 1.1: ULP systems are commonly powered by energy harvesting,
battery, or a combination of the two, where harvesters are used to charge
the battery.
1
For each of the above classes, the sizes of energy harvesting and/or storage
components determine the form factor, size, and weight. Consider, for exam-
ple, the wireless sensor node shown in Figure 1.2 [7]. The two largest system
components that predominantly determine the overall system size and weight
are the energy harvester (solar cell) and the battery.
Going one step further, since the energy harvesting and storage require-
ments of a ULP system are determined by its power and energy requirements,
the peak power and energy requirements of a ULP system are the primary
factors that determine critical system characteristics such as size, weight,
cost, and lifetime [6]. In Type 1 systems, peak power is the primary con-
straint that determines system size, since the power delivered by harvesters
is proportional to their size. In these systems, harvesters must be sized to
provide enough power, even under peak load conditions. In Type 3 systems,
peak power largely determines battery life, since it determines the effective
battery capacity [8]. As the rate of discharge increases, effective battery ca-
pacity drops [8, 9]. This effect is particularly pronounced in ULP systems,
where near-peak power is consumed for a short period of time, followed by
a much longer period of low-power sleep, since pulsed loads with high peak
current reduce effective capacity even more drastically than sustained current
draw [9].
Figure 1.2: In most ULP systems, like this wireless sensor node, the size of
the battery and/or energy harvester dominates the total system size.
2
Figure 1.3: Harvester and battery size calculations for Type 1, 2, and 3
ULP systems depend on peak power and energy requirements.
In Type 2 and 3 systems, the peak energy requirement matters as well.
For example, energy harvesters in Type 2 systems must be able to harvest
more energy than the system consumes, on average. Similarly, battery life
and effective capacity are dependent on energy consumption (i.e., average
power) [9]. Figure 1.3 summarizes how peak power and energy requirements
impact sizing parameters for the different classes of ULP systems.
Finally, Tables 1.1 and 1.2 list the energy and power densities for different
types of batteries and energy harvesters, respectively. These data provide
a rough sense of how size and weight of a ULP system scale based on peak
energy and power requirements. A tighter bound on the peak power and
energy requirements of a ULP system can result in a roughly proportional
reduction in size and weight.
How are Peak Power and Energy Determined Today?
There are several possible approaches to determine the peak power and en-
ergy requirements of a ULP processor (Figure 1.4).1 The most conservative
approach involves using the processor design specifications provided in data
sheets. These specifications characterize the peak power that can be con-
sumed by the hardware at a given operating point and can be directly trans-
1Peak power and energy are sometimes referred to as worst-case power and energy.
3
Table 1.1: Specific energy and energy density for different battery
types [10].
Battery Specific Energy Energy Density
Type [J/g] [MJ/L]
Li-ion 460 1.152
Alkaline 400 0.331
Carbon-zinc 130 1.080
Ni-MH 340 0.504
Ni-cad 140 0.828
Lead-acid 146 0.360
Table 1.2: Power density for different types of energy harvesters [11].
Harvester type Power Density
Photovoltaic (sun) 100 mW/cm2
Photovoltaic (indoor) 100 µW/cm2
Thermoelectric 60 µW/cm2
Ambient airflow 1 mW/cm2
lated into a bound on peak power. This bound is conservative because it
is not application-specific; however, it is safe for any application that might
be executed on the hardware. A more aggressive technique for determining
peak power or energy requirements is to use a peak power or energy stress-
mark. A stressmark is an application that attempts to activate the hardware
in a way that maximizes peak power or energy. A stressmark may be less
conservative than a design specification, since it may not be possible for an
application to exercise all parts of the hardware at once. The most aggres-
sive conventional technique for determining peak power or energy of a ULP
processor is to perform application profiling on the processor by measuring
power consumption while running the target application on the hardware.
However, since profiling is performed with specific input sets under specific
operating conditions, peak power or energy bounds determined by profiling
might be exceeded during operation if application inputs or system operating
conditions are different than during profiling. To ensure that the processor
operates within its peak power and energy bounds, a guardband is applied
to profiling-based results.
Most ULP embedded systems run the same application or computation
over and over in a compute / sleep cycle for the entire lifetime of the sys-
tem [12]. As such, the power and energy requirements of embedded ULP
4
Figure 1.4: The conventional methodology for sizing energy harvesting and
storage components involves determining peak power and energy
requirements for a processor and selecting components that will provide
enough power and energy to satisfy the requirements over the lifetime of
the system.
MULT
WDG
REGISTER FILE
FRONTEND
MEM_BACKBONE
ALU
MISC
(a) Active gates at the peak cycle
for tHold (452 gates).
MULT
WDG
REGISTER FILE
FRONTEND
MEM_BACKBONE
ALU
MISC
(b) Active gates at the peak cycle
for PI (743 gates).
Figure 5: Different applications can have different activity
profiles, resulting in peak power and energy requirements
that are application-specific.
ensure that the processor operates within its peak power and
energy bounds, a guardband is applied to profiling-based
results.
Our Proposal: Determining Application-specific Peak
Power and Energy Requirements
Most ULP embedded systems run the same application or
computation over and over in a compute / sleep cycle for
the entire lifetime of the system [1]. As such, the power and
energy requirements of embedded ULP processors tend to
be application-specific. This is not surprising, considering
that different applications exercise different hardware com-
ponents at different times, generating different application-
specific loads and power profiles. For example, Figures 5a
and 5b show the active (toggling) gates for two different
applications (tHold and PI – see Table 3) during the cy-
cles in which peak power is expended for each applica-
tion. These figures were generated by running gate-level
simulations of the applications on openMSP430 [20] and
marking all gates that toggled in the cycle in which each
benchmark expended its peak power. The figures show that
PI exercises a larger fraction of the processor than tHold
at its peak, leading to higher peak power. However, while
the peak power and energy requirements of ULP proces-
sors tend to be application-specific, many conventional tech-
niques for determining peak power and energy requirements
for a processor are not application-specific (e.g., design-
based and stressmark-based techniques). Even in the case of
a profiling-based technique, guardbands must be used to in-
flate the peak power requirements observed during profiling,
since it is not possible to generate bounds that are guaran-
teed for all possible input sets. These limitations prevent
existing techniques from accurately bounding the power and
energy requirements of an application running on a proces-
sor, leading to over-provisioning that increases system size
and weight.
In this paper, we present a novel technique that deter-
mines application-specific peak power and energy require-
ments based on hardware-software co-analysis of the appli-
cation and ultra-low-power processor in an embedded sys-
tem. Our technique performs a symbolic simulation of an
application on the processor netlist in which unknown logic
values (Xs) are propagated for application inputs.2 This al-
lows us to identify gates that are guaranteed to not be exer-
2Peak power and energy analyses can be offered as a cloud compilation
service by the hardware system vendor in settings where the application
developer does not have access to the processor description [6, 15, 24].
cised by the application for any input. This, in turn, allows
us to bound the peak power and energy requirements for the
application. The peak power and energy requirements gen-
erated by our technique are guaranteed to be safe for all pos-
sible inputs and operating conditions. Our technique is fully
automated and provides more accurate, tighter bounds than
conventional techniques for determining peak power and en-
ergy requirements. Our paper makes the following contribu-
tions.
• We present an automated technique based on symbolic
simulation that takes an embedded system’s application
software and processor netlist as inputs and determines
application-specific peak power and energy requirements
for the processor that are guaranteed to be valid for all pos-
sible application inputs and operating conditions. This is the
first approach to use symbolic simulation to determine peak
power and energy requirements for an application running
on a processor.
•We show that the application-specific peak power and en-
ergy requirements determined by our technique are more
accurate, and therefore less conservative, than those deter-
mined by conventional techniques. On average, the peak
power requirements generated by our technique are 27%,
26%, and 15% lower than those generated based on design
specifications, a stressmark, and profiling, respectively, and
the peak energy requirements generated by our technique are
47%, 26%, and 17% lower. Reduction in the peak power and
energy requirements of a ULP processor can be leveraged to
improve critical system metrics such as size and weight.
• Our technique can be used to guide optimizations that tar-
get and reduce the peak power of a processor. Optimizations
suggested by our technique reduce peak power by up to 10%
for a set of embedded applications.
2. A Case for Application-specific
Input-independent Peak Power and
Energy Requirements
We measured peak power consumption for a sample set of
ULP benchmark applications (see Table 3) running on an
MSP430F1610 processor.3 Benchmark applications were
run repeatedly with different inputs at an operating fre-
quency of 8 MHz while sampling the voltage and current
of the processor at a rate of 10 MHz using an InfiniiVision
DSO-X 2024A oscilloscope, to ensure at least one sample
per cycle. Power is calculated as the product of voltage and
current. Figure 6 shows our test setup.
Figure 7a compares the peak power observed for different
applications. The results show that peak power can be differ-
ent for different applications. Thus, peak power bounds that
are not application-specific will overestimate the peak power
requirements of applications, leading to over-provisioning
of energy harvesting and storage components that determine
system size and weight. Figure 7a also shows that the peak
power requirements of applications are significantly lower
than the rated peak power of the chip (4.8 mW), so using
design specifications to determine peak power requirements
can lead to significant over-provisioning and inefficiency.
The figure also confirms that peak power of an application
depends on application inputs and can vary significantly for
different inputs. This means that profiling cannot be relied
3MSP430 is one of the most popular processors used in ULP sys-
tems [8, 46].
Figure 1.5: Different applications can have different activity profiles,
resulting in peak power and energy requirements that are
application-specific.
5
processors tend to be application-specific. This is not surprising, considering
that different applications exercise different hardware components at different
times, generating different application-specific loads and power profiles. For
example, Figure 1.5 shows the active (toggling) gates for two different appli-
cations (tHold and PI – see Table 4.1) during the cycles in which peak power
is expended for each application. These activity profiles were generated by
running gate-level simulations of the applications on openMSP430 [13] and
marking all gates that toggled in the cycle in which each benchmark expended
its peak power. The profiles show that PI exercises a larger fraction of the
processor than tHold at its peak, leading to higher peak power. However,
while the peak power and energy requirements of ULP processors tend to
be application-specific, many conventional techniques for determining peak
power and energy requirements for a processor are not application-specific
(e.g., design-based and stressmark-based techniques). Even in the case of
a profiling-based technique, guardbands must be used to inflate the peak
power requirements observed during profiling, since it is not possible to gen-
erate bounds that are guaranteed for all possible input sets. These limita-
tions prevent existing techniques from accurately bounding the power and
energy requirements of an application running on a processor, leading to
over-provisioning that increases system size and weight.
In this thesis, we present a novel technique that determines application-
specific peak power and energy requirements based on hardware-software
co-analysis of the application and ultra-low-power processor in an embedded
system. Our technique performs a symbolic simulation of an application on
the processor netlist in which unknown logic values (Xs) are propagated for
application inputs.2 This allows us to identify gates that are guaranteed to
not be exercised by the application for any input. This, in turn, allows us to
bound the peak power and energy requirements for the application. The peak
power and energy requirements generated by our technique are guaranteed to
be safe for all possible inputs and operating conditions. Our technique is fully
automated and provides more accurate, tighter bounds than conventional
techniques for determining peak power and energy requirements. This thesis
makes the following contributions.
2Peak power and energy analyses can be offered as a cloud compilation service by the
hardware system vendor in settings where the application developer does not have access
to the processor description [14, 15, 16].
6
• We present an automated technique based on symbolic simulation that
takes an embedded system’s application software and processor netlist as in-
puts and determines application-specific peak power and energy requirements
for the processor that are guaranteed to be valid for all possible application
inputs and operating conditions. This is the first approach to use symbolic
simulation to determine peak power and energy requirements for an applica-
tion running on a processor.
•We show that the application-specific peak power and energy requirements
determined by our technique are more accurate, and therefore less conser-
vative, than those determined by conventional techniques. On average, the
peak power requirements generated by our technique are 27%, 26%, and 15%
lower than those generated based on design specifications, a stressmark, and
profiling, respectively, and the peak energy requirements generated by our
technique are 47%, 26%, and 17% lower. Reduction in the peak power and
energy requirements of a ULP processor can be leveraged to improve critical
system metrics such as size and weight.
• Our technique can be used to guide optimizations that target and reduce
the peak power of a processor. Optimizations suggested by our technique
reduce peak power by up to 10% for a set of embedded applications.
7
CHAPTER 2
A CASE FOR APPLICATION-SPECIFIC
INPUT-INDEPENDENT PEAK POWER
AND ENERGY REQUIREMENTS
We measured peak power consumption for a sample set of ULP benchmark
applications (see Table 4.1) running on an MSP430F1610 processor.1 Bench-
mark applications were run repeatedly with different inputs at an operating
frequency of 8 MHz while sampling the voltage and current of the processor
at a rate of 10 MHz using an InfiniiVision DSO-X 2024A oscilloscope, to
ensure at least one sample per cycle. Power is calculated as the product of
voltage and current. Figure 2.1 shows our test setup.
Figure 2.2 compares the peak power observed for different applications.
The results show that peak power can be different for different applications.
Thus, peak power bounds that are not application-specific will overestimate
the peak power requirements of applications, leading to over-provisioning
of energy harvesting and storage components that determine system size
and weight. Figure 2.2 also shows that the peak power requirements of
applications are significantly lower than the rated peak power of the chip (4.8
mW), so using design specifications to determine peak power requirements
can lead to significant over-provisioning and inefficiency. The figure also
confirms that peak power of an application depends on application inputs
and can vary significantly for different inputs. This means that profiling
cannot be relied on to accurately determine the peak power requirement
for a processor, since not all input combinations can be profiled, and the
peak power for an unprofiled input could be significantly higher than the
peak power observed during profiling. Since input-induced variations change
peak power by over 25% for these applications (Figure 2.2), a profiling-based
approach for determining peak power requirements should apply a guardband
of at least 25% to the peak power observed during profiling.
For energy-constrained ULP systems, like those powered by batteries (Type
2 and 3), peak energy as well as peak power determines the size of energy
1MSP430 is one of the most popular processors used in ULP systems [17].
8
Figure 2.1: The test setup used to measure peak and average power on a
ULP processor (MSP430).
Figure 6: The test setup used to measure peak and average
power on a ULP processor (MSP430).
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old
Benchmarks
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
Pe
ak
 P
ow
er
 [m
W
]
(a) Peak Power
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old
Benchmarks
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
N
or
m
al
iz
ed
 P
ea
k 
En
er
gy
 [J
/cy
cle
]
×10-10
(b) NPE
Figure 7: The peak power and normalized peak energy (nor-
malized to an application’s runtime in cycles) of a ULP
processor are different for different applications and differ-
ent inputs. The bars represent average across all inputs; er-
ror bars show the range of input-induced peak and average
power variations. Measured variation between multiple runs
of the same application and same input is less than 2%.
on to accurately determine the peak power requirement for a
processor, since not all input combinations can be profiled,
and the peak power for an unprofiled input could be signifi-
cantly higher than the peak power observed during profiling.
Since input-induced variations change peak power by over
25% for these applications (Figure 7a), a profiling-based ap-
proach for determining peak power requirements should ap-
ply a guardband of at least 25% to the peak power observed
during profiling.
For energy-constrained ULP systems, like those powered
by batteries (Type 2 and 3), peak energy as well as peak
power determines the size of energy harvesting and storage
components (Section 1). Thus, it is also important to deter-
mine an accurate bound on the peak energy requirements of
a ULP processor. Figure 8 shows the instantaneous power
profile for an application (mult), demonstrating that on av-
erage, instantaneous power can be significantly lower than
peak power. Therefore, we can more accurately determine
the optimal sizing of components in an energy-constrained
system by generating an accurate bound on peak energy,
rather than conservatively multiplying peak power by exe-
cution time.
0 0.2 0.4 0.6 0.8 1
Time [seconds] #10-3
0
0.5
1
1.5
2
2.5
Po
w
er
 [m
W
]
Figure 8:Measured instantaneous power of MSP430F1610
for the mult benchmark is significantly lower, on average,
than both the rated and observed peak power for the applica-
tion.
Figure 7b characterizes the peak energy, normalized to
application runtime in cycles, for different applications and
input sets, showing that the maximum rate at which an ap-
plication can consume energy is also application- and input-
dependent. Therefore, conventional techniques for determin-
ing the peak energy requirements of a ULP processor have
the same limitations as conventional techniques for deter-
mining peak power requirements. In both cases, the limita-
tions of conventional techniques require over-provisioning
that can substantially increase system size and weight.
In the next section, we describe a novel technique for
determining the peak power and peak energy requirements
of a ULP processor that is application-specific yet also input-
independent.
3. Application-Specific Input-indep-endent
Peak Power and Energy
Figure 9 provides an overview of our technique for deter-
mining application-specific peak power and energy require-
ments that are input-independent. The inputs to our tech-
nique are the application binary that runs on a ULP pro-
cessor and the gate-level netlist of the ULP processor. The
first phase of our technique, described in Section 3.1, is an
activity analysis that uses symbolic simulation to efficiently
characterize all possible gates that can be exercised for all
possible execution paths of the application and all possible
inputs. This analysis also reveals which gates can never be
exercised by the application. Based on this analysis, we per-
form input-independent peak power (Section 3.2) and energy
(Section 3.3) calculations to determine the peak power and
energy requirements for a ULP processor.
3.1 Input-Independent Gate Activity Analysis
Since the peak power and energy requirements of an appli-
cation can vary based on application inputs, a technique that
determines application-specific peak power requirements
must bound peak power for all possible inputs. Exhaustive
profiling for all possible inputs is not possible for most ap-
plications, so we have created a novel approach for activity
analysis that uses unknown logic values (Xs) for inputs to
efficiently characterize activity for all possible inputs with
minimum simulation effort.
Our technique, described in Algorithm 1, is based on
symbolic simulation [9] of an application binary running on
the gate-level netlist of a processor, in which Xs are propa-
gated for all signal values that cannot be constrained based
on the application. When the simulation begins, the states
of all gates and memory locations that are not explicitly
Figure 2.2: The peak power and normalized peak energy (normalized to an
application’s runtime in cycles) of a ULP processor are different for
different applications and different inputs. The bars represent average
across all inputs; error bars show the range of input-induced peak and
average ower variations. Measur d variation between mu tiple r ns of the
same application and same input is less than 2%.
9
0 0.2 0.4 0.6 0.8 1
Time [seconds] #10-3
0
0.5
1
1.5
2
2.5
Po
w
er
 [m
W
]
Figure 2.3: Measured instantaneous power of MSP430F1610 for the mult
benchmark is significantly lower, on average, than both the rated and
observed peak power for the application.
harvesting and storage components (Chapter 1). Thus, it is also important
to determine an accurate bound on the peak energy requirements of a ULP
processor. Figure 2.3 shows the instantaneous power profile for an appli-
cation (mult), demonstrating that on average, instantaneous power can be
significantly lower than peak power. Therefore, we can more accurately de-
termine the optimal sizing of components in an energy-constrained system
by generating an accurate bound on peak energy, rather than conservatively
multiplying peak power by execution time.
Figure 2.2 characterizes the peak energy, normalized to application runtime
in cycles, for different applications and input sets, showing that the maximum
rate at which an application can consume energy is also application- and
input-dependent. Therefore, conventional techniques for determining the
peak energy requirements of a ULP processor have the same limitations as
conventional techniques for determining peak power requirements. In both
cases, the limitations of conventional techniques require over-provisioning
that can substantially increase system size and weight.
In the next chapter, we describe a novel technique for determining the peak
power and peak energy requirements of a ULP processor that is application-
specific yet also input-independent.
10
CHAPTER 3
APPLICATION-SPECIFIC
INPUT-INDEPENDENT PEAK POWER
AND ENERGY
Figure 3.1 provides an overview of our technique for determining application-
specific peak power and energy requirements that are input-independent.
The inputs to our technique are the application binary that runs on a ULP
processor and the gate-level netlist of the ULP processor. The first phase
of our technique, described in Section 3.1, is an activity analysis that uses
symbolic simulation to efficiently characterize all possible gates that can be
exercised for all possible execution paths of the application and all possible
inputs. This analysis also reveals which gates can never be exercised by
the application. Based on this analysis, we perform input-independent peak
power (Section 3.2) and energy (Section 3.3) calculations to determine the
peak power and energy requirements for a ULP processor.
3.1 Input-Independent Gate Activity Analysis
Since the peak power and energy requirements of an application can vary
based on application inputs, a technique that determines application-specific
peak power requirements must bound peak power for all possible inputs. Ex-
haustive profiling for all possible inputs is not possible for most applications,
so we have created a novel approach for activity analysis that uses unknown
logic values (Xs) for inputs to efficiently characterize activity for all possible
inputs with minimum simulation effort.
Our technique, described in Algorithm 1, is based on symbolic simula-
tion [18] of an application binary running on the gate-level netlist of a pro-
cessor, in which Xs are propagated for all signal values that cannot be con-
strained based on the application. When the simulation begins, the states of
all gates and memory locations that are not explicitly loaded with the binary
are initialized to Xs. During simulation, all input values are replaced with Xs
11
Gate	Activity	
Analysis
Design
Binary
Netlist	(.v)	
X’s	as	inputs Symbolic	
Execution	
Tree
Peak	
Power/Energy	
Calculation
App-specific	
Peak	
Power/Energy
Requirement
Figure 3.1: Our technique performs input-independent activity analysis
that enables determination of accurate peak power and energy requirements
for a ULP processor.
by our simulator. As simulation progresses, the simulator dynamically con-
structs an execution tree describing all possible execution paths through the
application. If an X symbol propagates to the inputs of the program counter
(PC) during simulation, indicating an input-dependent control sequence, a
branch is created in the execution tree. Normally, the simulator pushes the
state corresponding to one execution path onto a stack for later analysis
and continues down the other path. However, a path is not pushed to the
stack or re-simulated if it has already been simulated (i.e., if the simulator
has seen the branch (PC) before and the processor state is the same as it
was when the branch was previously encountered). This allows Algorithm 1
to analyze programs with input-dependent loops. When simulation down
one path reaches the end of the application, an un-simulated state is loaded
from the last input-dependent branch in depth-first order, and simulation
continues. When all execution paths have been simulated to the end of the
application (i.e., depth-first traversal of the control flow graph terminates),
activity analysis is complete.1
During symbolic simulation, the simulator captures the activity of each
gate at each point in the execution tree. A gate is considered active if its
value changes or if it has an unknown value (X) and is driven by an active
gate; otherwise, the gate is idle. The resulting annotated symbolic execution
tree describes all possible instances in which a gate could possibly toggle for
1Complex applications and processors might require heuristics for exploration of a large
number of execution paths [19, 20]; however, our approach is adequate for ULP systems,
which tend to have simple processors and applications. For example, complete analysis of
our most complex benchmark takes 2 hours.
12
Algorithm 1 Input-independent Gate Activity Analysis
1. Procedure Create Symbolic Execution Tree(app binary, design netlist)
2. Initialize all memory cells and all gates in design netlist to X
3. Load app binary into program memory
4. Propagate reset signal
5. s← State at start of app binary
6. Symbolic Execution Tree T .set root(s)
7. Stack of un-processed execution paths, U .push(s)
8. while U != ∅ do
9. e← U .pop()
10. while e.PC next != X and !e.END do
11. e.set inputs X() // set all peripheral port inputs to Xs
12. e′ ← propagate gate values(e) // simulate this cycle
13. e.annotate gate activity(e,e′) // annotate activity in tree
14. e.add next state(e′) // add to execution tree
15. e← e′ // process next cycle
16. end while
17. if e.PC next == X then
18. for all a ∈ possible PC next vals(e) do
19. if a /∈ T then
20. e′ ← e.update PC next(a)
21. U .push(e′)
22. T .insert(a)
23. end if
24. end for
25. end if
26. end while
all possible executions of the application binary. As such, a gate that is not
marked as toggled at a particular location in the execution tree can never
toggle at that location in the application. As described in the next sections,
we can use the information gathered during activity analysis to bound the
peak power and energy requirements of an application.
3.2 Input-Independent Peak Power Requirements
The input to the second phase of our technique is the symbolic execution
tree generated by input-independent gate activity analysis. Algorithm 2
describes how to use the activity-annotated execution tree to generate peak
power requirements for a ULP processor, application pair.
The first step in determining peak power from an execution tree produced
during gate activity analysis is to concatenate the execution paths in the
execution tree into a single execution trace. We use a value change dump
(VCD) file to record the gate-level activity in the execution trace. The exe-
cution trace contains Xs, and the goal of the peak power computation is to
assign values to the Xs in the way that maximizes power for each cycle in
the execution trace. The power of a gate in a particular cycle is maximized
13
Algorithm 2 Input-independent Peak Power Computation
1. Procedure Calculate Peak Power
2. {E—O} VCD ← Open {Even—Odd} VCD File // maximizes peak power in even—odd cycles
3. T ← flatten(Execution Tree) // create a flattened execution trace that represents the execution tree
4. for all {even—odd} cycles c ∈ T do
5. for all toggled gates g ∈ c do
6. if value(g,c) == X && value(g,c-1) == X then
7. value(g,c-1) ← maxTransition(g,1) // returns the value of the gate in the first cycle of the
gate’s maximum power transition
8. value(g,c) ← maxTransition(g,2) // returns the value of the gate in the second cycle of the
gate’s maximum power transition
9. else if value(g,c) == X then
10. value(g,c) ← !value(g,c-1)
11. else if value(g,c-1) == X then
12. value(g,c-1) ← !value(g,c)
13. end if
14. end for
15. {E—O} VCD ← value(*,c-1)
16. {E—O} VCD ← value(*,c)
17. end for
18. Perform power analysis using E VCD and O VCD to generate even and odd power traces, PE and PO
19. Interleave even cycle power from PE with odd cycle power from PO to form peak power trace, Ppeak
20. peak power ← max(Ppeak)
when the gate transitions (toggles). Since a transition involves two cycles,
maximizing dynamic power in a particular cycle, c, of the execution trace
involves assigning values to any Xs in the activity profiles of the current and
previous cycles, c and c− 1, to maximize the number of transitions in cycle
c.
The number and power of transitions are maximized as follows. When the
output value of a gate in only one of the cycles, c or c − 1, is an X, the
X is assigned the value that assumes that a transition happened in cycle c.
When both values are Xs, the values are assigned to produce the transition
that maximizes power in cycle c. The maximum power transition is found
by a look-up into the standard cell library for the gate. Since constraining
Xs in two consecutive cycles to maximize power in the second cycle may not
maximize power in the first cycle, we produce two separate VCD files – one
that maximizes power in all even cycles and one the maximizes power in all
odd cycles. To find the peak power of the application, we first run activity-
based power analysis on the design using the even and odd VCD files to
generate even and odd power traces. We then form a peak power trace by
interleaving the power values from the even cycles in the even power trace
and the odd cycles in the odd power trace. This peak power trace bounds
the peak power that is possible in every cycle of the execution trace. The
peak power requirement of the application is the maximum per-cycle power
14
1 2 3 4 5 6 7 8 9
g1 0 0 1 X X X 0 0 0
g2 0 X X X X X X 0 0
g3 0 0 0 1 X X X X 0
1 2 3 4 5 6 7 8 9
g1 0 0 1 0 0 1 1 0 0
g2 0 1 0 1 0 1 1 0 0
g3 0 0 0 1 0 1 0 1 0
1 2 3 4 5 6 7 8 9
g1 0 0 1 0 1 1 0 0 0
g2 0 0 1 0 1 0 1 0 0
g3 0 0 0 1 0 0 1 1 0
Figure 3.2: To determine a bound on peak power, we generate two different
activity profiles – one that maximizes power in even cycles (left) and one
that maximizes power in odd cycles (right).
value found in the peak power trace.2
Our VCD generation technique is illustrated in Figure 3.2. We use the
example of three gates with overlapping Xs that need to be assigned to max-
imize power in every cycle. We show two assignments – one that maximizes
peak power in all even cycles (left), and one that maximizes peak power
in all odd cycles (right). Assuming, for the sake of example, that all gates
have equal power consumption and that the 0→ 1 transition consumes more
power than the 1 → 0 transition for these gates, the highest possible peak
power for this example happens in cycle 6 in the “even” activity trace, when
all the gates have a 0→ 1 transition.
3.3 Input-Independent Peak Energy Requirements
Our technique generates a per-cycle peak power trace characterizing all pos-
sible execution paths of an application. The peak power trace can be used to
generate peak energy requirements. Figure 3.3 shows per-cycle peak power
traces sampled from our benchmark applications. Since per-cycle peak power
varies significantly over the compute phases of an application, peak energy
2It is possible that glitching between clock edges can impact the power profile for an
application. This impact can be accounted for by Primetime’s power analysis [21].
15
can be significantly lower than assuming the maximum peak energy (i.e.,
peak power ∗ clock period ∗ number of cycles). Instead, the peak energy of
an application is bounded by the execution path with the highest sum of per-
cycle peak power multiplied by the clock period. To avoid enumerating all
execution paths, we use several techniques. For an input-dependent branch,
peak energy is computed by selecting the branch path with higher energy.
For a loop whose number of iterations is input-independent, peak energy can
be computed as the peak energy of one iteration multiplied by the number of
iterations. For cases where the number of iterations is input-dependent, the
maximum number of iterations may be determined either by static analysis
or user input (as suggested by prior work [22]) .3 If neither is possible, it
may not be possible to compute the peak energy of the application; however,
this is uncommon in embedded applications [12].
3.4 Validation of X-based Analysis
To demonstrate that our symbolic execution-based (X-based) activity anal-
ysis marks all gates that could possibly be toggled by an application for
all possible inputs, we performed a validation check by comparing the sets
of gates toggled by input-based simulations for several different input sets
against the set of gates marked as potentially-toggled by symbolic simula-
tion. Figure 3.4 illustrates this comparison for two input-based simulations
of the mult benchmark with different input sets – those that have the lowest
and highest number of toggled gates. In the figure, toggled gates common
to X-based and input-based simulation are shown as Xs, and gates that are
exclusively marked by symbolic simulation as potentially-toggled are shown
as blue triangles. As expected, no gate is exclusively marked by input-based
simulation. Our validation results show that all the gates toggled by input-
based simulation are also marked as potentially-toggled by X-based symbolic
simulation, validating the correctness of our approach for characterizing tog-
gle activity.
We perform a second validation of our technique by comparing the peak
power traces generated for benchmarks by our technique against power traces
3The number of loop iterations is bounded for all evaluated benchmarks. In general,
applications with unbounded runtimes are uncommon in embedded domains.
16
1
2
3 ×10
-3 autoCorr
1
2
3 ×10
-3 binSearch
1
2
3 ×10
-3 FFT
1
2
3 ×10
-3 intFilt
1
2
3 ×10
-3 mult
0
2
4 ×10
-3 PI
1
2
3
Pe
ak
 P
ow
er
 [W
]
×10-3 tea8
0
1
2 ×10
-3 tHold
0
2
4 ×10
-3 div
0
2
4 ×10
-3 inSort
0
2
4 ×10
-3 rle
1
2
3 ×10
-3 intAvg
1
1.5
2 ×10
-3 ConvEn
1.96 1.965 1.97 1.975 1.98 1.985 1.99
Time [s] ×10-3
1
2
3 ×10
-3 Viterbi
Figure 3.3: The per-cycle peak power varies significantly over the course of
an application, showing that the worst-case average power can be
significantly lower than peak power. Therefore, the peak energy can be
significantly lower than the product of peak power and application runtime
would suggest.
17
MULT
WDG
REGISTER FILE
FRONTEND
MEM_BACKBONE
ALU
MISC
MULT
WDG
REGISTER FILE
FRONTEND
MEM_BACKBONE
ALU
MISC
Figure 3.4: Toggled gates for mult with low-activity inputs (top) and
high-activity inputs (bottom), compared against potentially-toggled gates
identified by X-based analysis. X-based simulation marks all gates that can
potentially toggle for an application for all possible inputs. This set of gates
(unique x ∪ common) is a superset of the gates that toggle during an
input-based application execution (common).
18
1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Time [seconds] #10-6
1
1.5
2
2.5
Po
w
er
 [m
W
] input-based
X-based
Figure 3.5: The X-based peak power trace generated by our technique for
an application provides an upper bound on all possible input-based power
traces for the application. (Result shown for mult.)
generated by input-based execution of the benchmarks. The validation re-
sults confirm that our peak power trace always provides an upper bound on
the power of any input-based power trace. Figure 3.5 shows an example;
the X-based peak power trace for the mult application is always higher than
the input-based power trace. These validation results also show that the
X-based peak power trace closely matches the input-based trace, indicating
that the peak power and energy requirements generated by our technique are
not overly conservative.
3.5 Enabling Peak Power Optimizations
Since our technique is able to associate the input-independent peak power
consumption of a processor with the particular instructions that are in the
pipeline during a spike in peak power, we can use our tool to identify which
instructions or instruction sequences cause spikes in peak power. Our tech-
nique can also provide a power breakdown that shows the power consump-
tion of the microarchitectural modules that are exercised by the instructions.
These analyses can be combined to identify which instructions executing in
which modules cause power spikes. After identifying the cause of a spike,
we can use software optimizations to target the instruction sequences that
cause peaks and replace them with alternative sequences that generate less
instantaneous activity and power while maintaining the same functionality.
After optimizing software to reduce a spike in peak power, we can re-run
19
our peak power analysis technique to determine the impact of optimizations
on peak power. Guided by our technique, we can choose to apply only the
optimizations that are guaranteed to reduce peak power.
142 143 144 145 146 147 148 149 150 151 152
Cycle Number
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
Po
we
r [
m
W
]
Frontend load load pop pop add add
Execute/Mem store load load load pop pop
COI 146
 
 
 
 
 
 
 
 
COI 150
 
 
 
 
 
 
 
 
clk_module
dbg
exec_unit
frontend
mem_backbone
multiplier
sfr
watchdog
Figure 3.6: A snapshot of instantaneous power profiles for mult at two
different COIs where peaks occur. Our technique analyzes the instructions
in the pipeline (top) to find each COI’s culprit instructions that cause the
peak power in each pipeline stage along with the per-module peak power
breakdown (bottom) to identify which instructions in which
microarchitectural modules are responsible for a peak.
Figure 3.6 shows an example where our technique identifies peak power
spikes in cycles 146 and 150. Our technique also reports the instructions
20
in each stage of the pipeline during those cycles of interest (COIs), as well
as the per-module power breakdown for those cycles, which identifies the
modules that are consuming the most power. This information can be used
to guide optimizations that replace the instructions with different instruction
sequences that induce less activity and power in the modules that consume
the most power. Since software optimizations can impact performance as
well as peak power, we will discuss optimizations that reduce peak power
and their impact on performance and energy in Section 5.1.
21
CHAPTER 4
METHODOLOGY
4.1 Simulation Infrastructure and Benchmarks
We verify our technique on a silicon-proven processor – openMSP430 [13], an
open-source version of one of the most popular ULP processors [17]. The pro-
cessor is synthesized, placed, and routed in TSMC 65GP technology (65nm)
for an operating point of 1V and 100 MHz using Synopsys Design Com-
piler [23] and Cadence EDI System [24]. Gate-level simulations are per-
formed by running full benchmark applications on the placed and routed
processor using a custom gate-level simulator that efficiently traverses the
control flow graph of an application and captures input-independent activ-
ity profiles (Chapter 3). We show results for all benchmarks from [25] and
all EEMBC benchmarks that fit in the program memory of the processor.
These benchmarks are chosen to be representative of emerging ultra-low-
power application domains such as wearables, internet of things, and sensor
networks [25]. The IPC of these benchmarks on our processor varies from
1.25 to 1.39, with an average of 1.29. Power analysis is performed using
Synopsys Primetime [21]. Experiments were performed on a server housing
two Intel Xeon E-2640 processors (8-cores each, 2GHz operating frequency,
64GB RAM).
Chapter 2 shows measured data for an MSP430F1610 processor that demon-
strate that different applications have different peak power and energy re-
quirements, and the requirements of an application can vary significantly for
different inputs. The results motivate an application-specific input-independent
technique for determining the peak power and energy requirements for ULP
processors. For the results in Chapter 5, we perform evaluations on the open
source openMSP430 processor [13]. Figure 4.1a and Figure 4.1b confirm
that the peak power and energy requirements of openMSP430 also depend
22
142 143 144 145 146 147 148 149 150 151 152
Cycle Number
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
P
o
w
e
r 
[m
W
]
Frontend load load pop pop add add
Execute/Mem store load load load pop pop
COI 146
 
 
 
 
 
 
 
 
COI 150
 
 
 
 
 
 
 
 
clk_module
dbg
exec_unit
frontend
mem_backbone
multiplier
sfr
watchdog
Figure 14: A snapshot of instantaneous power profiles for
mult at two different COIs where peaks occur. Our tech-
nique analyzes the instructions in the pipeline (top) to find
each COI’s culprit instructions that cause the peak power in
each pipeline stage along with the per-module peak power
breakdown (bottom) to identify which instructions in which
microarchitectural modules are responsible for a peak.
opsys Primetime [42]. Experiments were performed on a
server housing two Intel Xeon E-2640 processors (8-cores
each, 2GHz operating frequency, 64GB RAM).
Section 2 shows measured data for an MSP430F1610
processor that demonstrate that different applications have
different peak power and energy requirements, and the re-
quirements of an application can vary significantly for dif-
ferent inputs. The results motivate an application-specific
input-independent technique for determining the peak power
and energy requirements for ULP processors. For the re-
sults in Section 5, we perform evaluations on the open
source openMSP430 processor [20]. Figures 15a and 15b
confirm that the peak power and energy requirements of
openMSP430 also depend on the application and applica-
tion inputs. Note that the results in Figure 7 and Figure 15
differ because they are for different implementations of the
MSP430 architecture (MSP430F1610 and openMSP430),
with different process technology (130 nm vs 65 nm) and
operating frequencies (8MHz vs 100 MHz).
4.2 Baselines
For baselines, we compare against conventional techniques
for determining the peak power and energy requirements
of processors. An overview of the baseline techniques can
be found in Figure 4. The design specification-based base-
line (design tool) is determined by performing power and
energy analysis of the design using the default input tog-
gle rate used by our design tools [42]. The stressmark-based
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old div
inS
ort rle
int
AV
G
Co
nv
En
Vit
erb
i
Benchmarks
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
Pe
ak
 P
ow
er
 [m
W
]
(a) Peak Power
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old div
inS
ort rle
int
Av
g
Co
nv
En
Vit
erb
i
Benchmarks
1.2
1.3
1.4
1.5
1.6
1.7
1.8
N
or
m
al
iz
ed
 P
ea
k 
En
er
gy
 [J
/cy
cle
]
×10-11
(b) NPE
Figure 15: Different applications and different input sets for
the same application have different peak power and peak
energy requirements. (results for openMSP430)
Table 3: Benchmarks
Embedded Sensor Benchmarks [48]
mult, binSearch, tea8, intFilt,
tHold, div, inSort, rle, intAVG
EEMBC Embedded Benchmarks [1]
Autocorr, FFT, ConvEn, Viterbi
Control Systems Benchmark
Proportional Integral Controller (PI)
baselines (GB input-based) use stressmarks that target peak
instantaneous power and average power. Kim et al. used a
genetic algorithm to automatically generate stressmarks that
target maximum di/dt-induced voltage droop for a micro-
processor [28]. We modified their framework to generate
stressmarks that target peak instantaneous power and aver-
age power for openMSP430. The profiling-based baseline
(input-based) is generated by performing input-based power
and energy profiling for several input sets and applying a
guardbanding factor of 4/3 to the peak power and energy ob-
served during profiling. The guardbanding factor is the same
as in prior studies [4, 30] and is appropriate for the input-
dependent peak power variability exhibited by our bench-
marks (Figure 7a).
5. Results
We use our technique described in Section 3 to determine
peak power and energy requirements for a ULP processor for
different benchmark applications. Figure 16 compares the
peak power requirements reported by our technique against
the conventional techniques for determining peak power re-
quirments, described in Section 4.2. The results show that
the peak power requirements reported by our X-based tech-
nique are higher than the highest input-based application-
specific peak power for all applications, confirming that our
technique provides a bound on peak power. The results also
show that our technique provides the most accurate bound
on peak power, compared to conventional techniques for de-
termining peak power requirements. For example, the peak
power requirements reported by our technique are only 1%
higher than the highest observed input-based peak power for
the benchmark applications, on average. Other techniques
for determining peak power and energy requirements are
significantly less accurate, which can lead to inefficiency in
critical system parameters such as size and weight (see Sec-
tion 1).
Our technique is more accurate than application-obliv-
ious techniques such as determining peak power require-
Figure 4.1: Different applications and different input sets for the same
application have different peak power and peak energy requirements.
(Results for openMSP430.)
on the application and application inputs. Note that the results in Figure 2.2
and Figure 4.1 differ because they are for different implementations of the
MSP430 architecture (MSP430F1610 and openMSP430), with different pro-
cess technology (130 nm vs 65 nm) and operating frequencies (8MHz vs 100
MHz).
4.2 Baselines
For baselines, we compare against conventional techniques for determining
the peak pow r and energy requirements of proc ssors. An overview of the
baseli techniques can be found in Figure 1.4. The design specification-
based baseline (design tool) is determined by performing power and energy
analysis of the design using the default input toggle rate used by our de-
sign tools [21]. The stressmark-based baselines (GB input-based) use stress-
marks that target peak instantaneous power and average power. Kim et al.
used a genetic algorithm to automatically generate stressmarks that target
maximum di/ t-induced voltage droop for a micr processor [26]. We modi-
fied their framework t generate stressmarks that target peak instant neous
power and average power for openMSP430. The profiling-based baseline
23
Table 4.1: Benchmarks
Embedded Sensor Benchmarks [25]
mult, binSearch, tea8, intFilt,
tHold, div, inSort, rle, intAVG
EEMBC Embedded Benchmarks [12]
Autocorr, FFT, ConvEn, Viterbi
Control Systems Benchmark
Proportional Integral Controller (PI)
(input-based) is generated by performing input-based power and energy pro-
filing for several input sets and applying a guardbanding factor of 4/3 to the
peak power and energy observed during profiling. The guardbanding factor
is the same as in prior studies [27, 28] and is appropriate for the input-
dependent peak power variability exhibited by our benchmarks in Table 4.1
(Figure 2.2).
24
CHAPTER 5
RESULTS
We use our technique described in Chapter 3 to determine peak power and
energy requirements for a ULP processor for different benchmark applica-
tions. Figure 5.1 compares the peak power requirements reported by our
technique against the conventional techniques for determining peak power
requirements, described in Section 4.2. The results show that the peak power
requirements reported by our X-based technique are higher than the highest
input-based application-specific peak power for all applications, confirming
that our technique provides a bound on peak power. The results also show
that our technique provides the most accurate bound on peak power, com-
pared to conventional techniques for determining peak power requirements.
For example, the peak power requirements reported by our technique are
only 1% higher than the highest observed input-based peak power for the
benchmark applications, on average. Other techniques for determining peak
power and energy requirements are significantly less accurate, which can
lead to inefficiency in critical system parameters such as size and weight (see
Chapter 1).
Our technique is more accurate than application-oblivious techniques such
as determining peak power requirements from a stressmark or design speci-
fication, because an application constrains which parts of the processor can
be exercised in a particular cycle. Our technique also provides a more ac-
curate bound than a guardbanded input-based peak power requirement, be-
cause it does not require a guardband to account for the non-determinism
of input-based profiling (shown in Figure 5.1 as error bars). By accounting
for all possible inputs using symbolic simulation, our technique can bound
peak power and energy for all possible application executions without guard-
banding. The peak power requirements reported by our technique are 15%
lower than guardbanded application-specific requirements, 26% lower than
guardbanded stressmark-based requirements, and 27% lower than design
25
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old div
inS
ort rle
int
AV
G
Co
nv
En
Vit
erb
i
str
es
sm
ark
de
sig
n_
too
l
Benchmarks
1
1.5
2
2.5
3
3.5
Pe
ak
 P
ow
er
 [m
W
]
design tool
input-based
GB input-based
X-based
Figure 5.1: Our X-based technique for determining peak power
requirements provides the most accurate (least conservative) guaranteed
bound on peak power.
specification-based requirements, on average.
Since our technique is application-specific and does not require guard-
bands, one question is, “Why is the bound provided by X-based analysis
more conservative for some applications than others?” The answer is that
since X-based analysis provides a bound on power for all possible inputs,
it becomes more conservative when there is greater possibility for input-
dependent variation in power. For example, the multiplier is a relatively
large, high-power module, with high potential for input-dependent variation
in power consumption. For some inputs (e.g., X ∗ 0), power consumed by
the multiplier is minimal, since there are no partial products to compute.
For other inputs (e.g., two very large numbers), the power consumed by the
multiplier is much larger. Since our symbolic simulation technique assumes
Xs for inputs, we always assume the highest possible power for a multiply in-
struction. Therefore, X-based peak power requirements for applications that
contain a large number of multiplications may be more conservative than
26
X-based requirements for other applications.
Conversely, the tea8 application, which performs encryption, only uses low-
power ALU modules – shift register and XOR – that have significantly less
potential for input-induced power variation. As a result, X-based analysis
closely matches input-based profiling results for this application. For all ap-
plications, even those with more potential for input-induced power variation,
our X-based analysis technique provides a peak power bound that is more
accurate than those provided by conventional techniques.
Our technique also provides more accurate bounds on peak energy than
conventional techniques, partly because of the reasons mentioned above, and
also because our technique is able to characterize the peak energy consump-
tion in each cycle of execution, generating a peak energy trace that accounts
for dynamic variations in energy consumption. Using a design specification
to determine peak energy is particularly inaccurate, since it does not con-
sider dynamic variations in the energy requirements of an application. The
guardbanded input-based technique, which does consider dynamic variations,
provides a more accurate peak energy bound than the design specification for
all benchmarks. However, it does not always provide a more accurate bound
than the design specification for peak power, since peak power is an instanta-
neous phenomenon that is less dependent on dynamic variations. Figure 5.2
presents peak energy of different benchmarks, normalized to application run-
time in cycles, i.e., peak average power, which characterizes the maximum
rate at which the application can consume energy. In Figure 5.2, the peak en-
ergy requirements reported by our technique are 17% lower than guardbanded
application-specific requirements, 26% lower than guardbanded stressmark-
based requirements, and 47% lower than design specification-based require-
ments, on average. As expected, application-specific normalized peak energy
(Figure 5.2) varies less than peak power (Figure 5.1), since peak energy
characterizes average peak power over the entire execution of an application,
whereas peak power corresponds to one instant in the application’s execution.
As described in Chapter 1, more accurate peak power and energy require-
ments can be leveraged to reduce critical ULP system parameters like size
and weight. For example, reduction in a Type 1 system’s peak power require-
ments allows a smaller energy harvester to be used. System size is roughly
proportional to harvester size in Type 1 systems. In Type 2 systems, it is
the peak energy requirement that determines the harvester size; reduction in
27
au
toC
orr
bin
Se
arc
h
FF
T
int
Fil
t
m
ult PI tea
8
tH
old div
inS
ort rle
int
AV
G
Co
nv
En
Vit
erb
i
str
es
sm
ark
de
sig
n_
too
l
Benchmarks
0
0.5
1
1.5
2
2.5
3
N
or
m
al
iz
ed
 P
ea
k 
En
er
gy
 [J
/cy
cle
] ×10
-11
design tool
input-based
GB input-based
x-based
Figure 5.2: Our X-based technique for determining peak energy
requirement (normalized to application run-time in cycles, i.e., the peak
average power) is more accurate than existing conventional techniques.
28
Table 5.1: Percentage reduction in harvester area compared to different
baseline techniques, averaged over all benchmarks, for different percentage
contributions of the processor peak power to the system peak power.
Baseline 10% 25% 50% 75% 90% 100%
GB-Input 1.49 3.73 7.47 11.21 13.45 14.94
GB-Stress 2.60 6.47 12.95 19.42 23.31 25.90
Design Tool 2.68 6.70 13.41 20.12 24.14 26.82
peak energy requirement reduces system size roughly proportionally. Since
required battery capacity depends on a system’s peak energy requirement,
and effective battery capacity depends on the peak power requirement, re-
ductions in peak power and energy requirements both reduce battery size for
Type 2 and 3 systems.
A ULP system may contain other components, such as transmitter/receiver,
ADC, DAC, and sensor(s), along with the processor. All of these components
may contribute to the system’s peak power and energy, and hence, the sizing
of the harvester and battery. Tables 5.1 and 5.2 show the percentage reduc-
tion in the harvester size and battery size, respectively, from our technique for
different fractions representing the processor’s contribution to the system’s
peak power and energy. For a real system such as the one shown in Fig-
ure 1.2, which has a harvester area of 32.6 cm2 and a battery volume of 6.95
mm3, the area reduction of the harvester is 4.87, 8.44, or 8.75 cm2 if the
system is designed using guardbanded input-based profiling, guardbanded
stressmark, or design tool, respectively, for estimating the peak power of the
processor. Similarly, the volume reduction of the battery is 0.42, 0.63, or
1.12 mm3, respectively.1 As expected, savings from our technique are higher
when the processor is the dominant consumer of power and energy in the
overall system.2
1The battery is a thin film battery of dimensions 5.7 mm × 6.1 mm × 200 µm (area
of 34.7 mm2). Assuming the height of the battery does not change, the corresponding
savings in battery area are 6.07, 9.01, and 16.18 mm2, respectively.
2ITRS 2015 projections show that the microcontroller will be the dominant consumer
of power in future IoT and IoE systems [2].
29
Table 5.2: Percentage reduction in battery volume compared to different
baseline techniques, averaged over all benchmarks, for different percentage
contributions of the processor energy to the overall energy of the system.
Baseline 10% 25% 50% 75% 90% 100%
GB-Input 1.74 4.37 8.74 13.11 15.73 17.48
GB-Stress 2.59 6.49 12.98 19.48 23.37 25.97
Design Tool 4.66 11.66 23.32 34.98 41.97 46.64
5.1 Optimizations
As discussed in Section 3.5, our technique can be used to guide application-
level optimizations that reduce peak power. Here, we discuss three software
optimizations, suggested by our technique, that we applied to the bench-
mark applications to reduce peak power. The optimizations were derived by
analyzing the processor’s behavior during the cycles of peak power consump-
tion. This analysis involves (a) identifying instructions in the pipeline at the
peak, and (b) identifying the power contributions of the microarchitectural
modules to the peak power to determine which modules contribute the most.
The first optimization aims to reduce a peak by “spreading out” the power
consumed in a peak cycle over multiple cycles. This is accomplished by
replacing a complex instruction that induces a lot of activity in one cycle
with a sequence of simpler instructions that spread the activity out over
several cycles.
The second optimization aims to reduce the instantaneous activity in a
peak cycle by delaying the activation of one or more modules, previously
activated in a peak cycle, until a later cycle. For this optimization, we focus
on the POP instruction, since it generates peaks in some benchmarks. The
peaks occur because a POP instruction generates high activity on the data
and address buses and simultaneously uses the incrementer logic to update
the stack pointer. To reduce the peak, we break down the POP instruction
into two instructions – one that moves data from the stack, and one that
increments the stack pointer.
The third optimization is based on the observation that for some applica-
tions, peak power is caused by the multiplier (a high-power peripheral mod-
ule) being active simultaneously with the processor core. To reduce peak
power in such scenarios, we insert a NOP into the pipeline during the cycle in
which the multiplier is active.
30
Table 4: Percentage reduction in harvester area compared to
different baseline techniques, averaged over all benchmarks,
for different percentage contributions of the processor peak
power to the system peak power.
Baseline 10% 25% 50% 75% 90% 100%
GB-Input 1.49 3.73 7.47 11.21 13.45 14.94
GB-Stress 2.60 6.47 12.95 19.42 23.31 25.90
Design Tool 2.68 6.70 13.41 20.12 24.14 26.82
Table 5: Percentage reduction in battery volume compared
to different baseline techniques, averaged over all bench-
marks, for different percentage contributions of the proces-
sor energy to the overall energy of the system.
Baseline 10% 25% 50% 75% 90% 100%
GB-Input 1.74 4.37 8.74 13.11 15.73 17.48
GB-Stress 2.59 6.49 12.98 19.48 23.37 25.97
Design Tool 4.66 11.66 23.32 34.98 41.97 46.64
lows a smaller energy harvester to be used. System size is
roughly proportional to harvester size in Type 1 systems. In
Type 2 systems, it is the peak energy requirement that deter-
mines the harvester size; reduction in peak energy require-
ment reduces system size roughly proportionally. Since re-
quired battery capacity depends on a system’s peak energy
requirement, and effective battery capacity depends on the
peak power requirement, reductions in peak power and en-
ergy requirements both reduce battery size for Type 2 and 3
systems.
A ULP system may contain other components, such as
transmitter/receiver, ADC, DAC, and sensor(s), along with
the processor. All of these components may contribute to
the system’s peak power and energy, and hence, the siz-
ing of the harvester and battery. Tables 4 and 5 show the
percentage reduction in the harvester size and battery size,
respectively, from our technique for different fractions rep-
resenting the processor’s contribution to the system’s peak
power and energy. For a real system such as the one shown
in Figure 2, which has a harvester area of 32.6cm2 and a bat-
tery volume of 6.95mm3, the area reduction of the harvester
is 4.87, 8.44, or 8.75cm2 if the system is designed using
guardbanded input-based profiling, guardbanded stressmark,
or design tool, respectively, for estimating the peak power of
the processor. Similarly, the volume reduction of the battery
is 0.42, 0.63, or 1.12mm3, respectively.7 As expected, sav-
ings from our technique are higher when the processor is the
dominant consumer of power and energy in the overall sys-
tem.8
5.1 Optimizations
As discussed in Section 3.5, our technique can be used
to guide application-level optimizations that reduce peak
power. Here, we discuss three software optimizations, sug-
7The battery is a thin film battery of dimensions 5.7mm × 6.1mm ×
200 µm (area of 34.7mm2). Assuming the height of the battery doesn’t
change, the corresponding savings in battery area are 6.07, 9.01, and
16.18mm2, respectively.
8ITRS 2015 projections show that the microcontroller will be the
dominant consumer of power in future IoT and IoE systems [2].
mov &0x013a, r15;
pop r2;
mov &0x013a, r15
mov #0, r9
mov @r1, r2
add #2, r15
(a) OPT 1
mov &0x013a, r15;
pop r2;
mov &0x013a, r15
mov #0, r9
mov @r1, r2
add #2, r1
OPT	1
(b) OPT 2
mov -6(r4), &0x0132
mov -4(r4), &0x0138
mov 0x013a, r15
mov -6(r4), &0x0132
mov -r(r4), &0x0138
nop
mov 0x013a, r15
OPT	3
(c) OPT 3
Figure 18: Instruction optimization transforms.
gested by our technique, that we applied to the benchmark
applications to reduce peak power. The optimizations were
derived by analyzing the processor’s behavior during the cy-
cles of peak power consumption. This analysis involves (a)
identifying instructions in the pipeline at the peak, and (b)
identifying the power contributions of the microarchitectural
modules to the peak power to determine which modules con-
tribute the most.
The first optimization aims to reduce a peak by “spread-
ing out” the power consumed in a peak cycle over multiple
cycles. This is accomplished by replacing a complex instruc-
tion that induces a lot of activity in one cycle with a sequence
of simpler instructions that spread the activity out over sev-
eral cycles.
The second optimization aims to reduce the instantaneous
activity in a peak cycle by delaying the activation of one or
more modules, previously activated in a peak cycle, until
a later cycle. For this optimization, we focus on the POP
instruction, since it generates peaks in some benchmarks.
The peaks are caused since a POP instruction generates high
activity on the data and address buses and simultaneously
uses the incrementer logic to update the stack pointer. To
reduce the peak, we break down the POP instruction into two
instructions – one that moves data from the stack, and one
that increments the stack pointer.
The third optimization is based on the observation that for
some applications, peak power is caused by the multiplier (a
high-power peripheral module) being active simultaneously
with the processor core. To reduce peak power in such sce-
narios, we insert a NOP into the pipeline during the cycle in
which the multiplier is active.
The three optimizations we applied to our benchmarks
to reduce peak power are summarized below. The optimiza-
tions are shown in Figure 18.
• Register-Indexed Loads (OPT 1): A load instruction
(MOV) that references the memory by computing the address
as an offset to a register’s value involves several micro-
operations – source address generation, source read, and ex-
ecute. Breaking the micro-operations into separate instruc-
tions can reduce the instantaneous power of the load instruc-
tion. The ISA already provides a register indirect load oper-
ation where the value of the register is directly used as the
memory address instead of as an offset. Using another in-
struction (such as an ADD or SUB), we can compute the cor-
rect address and store it into another register. We then use the
second register to execute the load in register indirect mode.
• POP instructions (OPT 2): The micro-operations of a
POP instruction are (a) read value from address pointed to
by the stack pointer, and (b) increment the stack pointer by
two. POP is emulated using MOV @SP+, dst. This can be
broken down to two instructions –
MOV @SP, dst and ADD #2, SP.
Figure 5.3: Instruction optimization transforms.
The three optimizations we applied to our benchmarks to reduce peak
power are summarized below. The optimizations are shown in Figure 5.3.
• Register-Indexed Loads (OPT 1): A load instruction (MOV) that ref-
erences the memory by computing the address as an offset to a register’s
value involves sev ral mic o- perations – source address generation, source
read, and execute. Brea ing th micro-operations into separate instructions
can reduce the instantaneous power of the load instruction. The ISA already
provides a register indirect load operation where the value of the register is
directly used as the memory address instead of as an offset. Using another
instruction (such as an ADD or SUB), we can compute the correct address and
store it into another register. We then use the second register to execute the
load in register indirect mode.
• POP in tructions (OPT 2): The micro-operations of a POP instruction
are (a) read value from address point d to by the st k pointer, and (b) inc e-
ment the stack pointer by two. POP is emulated using MOV @SP+, dst. This
can be broken down to two instructions – MOV @SP, dst and ADD #2, SP.
•Multiply (OPT 3): The multiplier is a peripheral in openMSP430. Data
is MOVed to the inputs of the multiplier and then the output is MOVed back
to the processor. For a 2-cycle multiplier, all moving of data can be done
consecutively without any waiting. However, this involves a high power draw,
since there will be a cycle when both the multiplier and the processor are
active. This can be avoided by adding a NOP between writing to and reading
from the multiplier.
31
auto
Cor
r
binS
earc
h FFT intF
ilt mul
t PI tea8 tHo
ld div inSo
rt rle
intA
VG
Con
vEnVite
rbi
0
5
10
Pe
ak
 P
ow
er
 R
ed
uc
tio
n 
%
0
20
40
Pe
ak
 D
yn
am
ic
Ra
ng
e 
Re
du
ct
io
n 
%
Peak Power Reduction
Peak Power DynamicRange Reduction
Figure 5.4: Peak power reduction (left axis) and peak power dynamic range
reduction (right axis) achieved by optimizations. These reductions are
enabled by our analysis tool and provide further reduction in energy
harvester size.
Figure 5.4 shows the reduction in peak power achieved by applying the
optimizations motivated by our technique. Results are quantified in terms
of peak power reduction, as well as reduction in peak power dynamic range,
which quantifies the difference between peak and average power. Peak power
dynamic range decreases as peaks are reduced closer to the range of average
power. Reduction in peak power dynamic range can improve battery lifetime
in Type 2 and 3 systems, and reduction in peak power requirements can be
leveraged to reduce harvester size in Type 1 systems (see Chapter 1). Our
results show that peak power can be reduced by up to 10%, and 5% on
average. Peak power dynamic range can be reduced by up to 34%, and
18% on average. Figure 5.5 shows the peak power traces for an example
application before and after optimization, demonstrating that optimization
can reduce the peak power requirements for an application.
Since optimizations that reduce peak power can increase the number of
instructions executed by an application, we evaluated the performance and
energy impact of the optimizations. Figure 5.6 shows the results. Applying
the optimizations suggested by our technique degrades performance by up to
5% for one application, and by 1% on average. On average, the optimizations
32
1.4 1.6 1.8 2 2.2
Time [seconds]
×10-6
1
1.5
2
2.5
Po
w
er
 [m
W
]
X-based
X-based opt
Figure 5.5: A snapshot of instantaneous power profiles for mult before and
after optimization.
increase energy by 3%. Although the optimizations increase energy slightly,
they can still enable reduction in size for Type 1 systems, in which harvester
size is dictated by peak power, and may also reduce the size of Type 2 and
3 systems, where both peak power and energy determine the size of energy
storage and harvesting components (see Figure 1.3).
au
toC
orr
bin
Sea
rch FFT intF
ilt
mu
lt PI tea
8
tHo
ld div
inS
ort rle
intA
VG
Con
vEnVite
rbi
0
5
10
Pe
rfo
rm
an
ce
 D
eg
ra
da
tio
n 
%
0
5
10
En
er
gy
 O
ve
rh
ea
d 
 %
Performance Degredation
Energy Overhead
Figure 5.6: Performance degradation and energy overhead introduced by
peak power optimizations is small (average: 1%).
33
CHAPTER 6
GENERALITY AND LIMITATIONS
We applied our techniques in the context of ULP processors that are already
the most widely-used type of processor and are also expected to power a large
number of emerging applications [29, 30, 31, 32, 33]. Such processors also
tend to be simple, run relatively simple applications, and do not support
non-determinism (no branch prediction and caching; for example, see Ta-
ble 6.1). This makes our symbolic simulation-based technique a good fit for
such processors. Below, we discuss how our technique may scale for complex
processors and applications, if necessary.
More complex processors contain more performance-enhancing features
such as large caches, prediction or speculation mechanisms, and out-of-order
execution, that introduce non-determinism into the instruction stream. Co-
analysis is capable of handling this added non-determinism at the expense
of analysis tool runtime. For example, by injecting an X as the result of a
tag check, both the cache hit and miss paths will be explored in the memory
hierarchy. Similarly, since co-analysis already explores taken and not-taken
paths for input-dependent branches, it can be adapted to handle branch pre-
diction. In an out-of-order processor, the ordering of instructions is based
on the dependence pattern between instructions. Thus, extending input-
independent CFG exploration to also explore the data flow graph (DFG)
may allow analysis of out-of-order execution.
In other application domains, there exist applications with more complex
Table 6.1: Microarchitectural features in recent embedded processors.
Processor Branch Predictor Cache
ARM Cortex-M0 no no
ARM Cortex-M3 yes no
Atmel ATxmega128A4 no no
Freescale/NXP MC13224v no no
Intel Quark-D1000 yes yes
Jennic/NXP JN5169 no no
SiLab Si2012 no no
TI MSP430 no no
34
CFGs. For more complex applications, heuristic techniques may be used
to improve scalability of hardware-software co-analysis. While heuristics
have been applied to improve scalability in other contexts (e.g., verifica-
tion) [19, 20], heuristics for hardware-software co-analysis must be conser-
vative to guarantee that no gate is marked as untoggled when it could be
toggled. The development of such heuristics is the subject of future work.
In a multi-programmed setting (including systems that support dynamic
linking), we take the union of the toggle activities of all applications (caller,
callee, and the relevant OS code in case of dynamic linking) to get a conserva-
tive peak power value. For self-modifying code, peak power for the processor
would be chosen to be the peak of the code version with the highest peak.
In case of fine-grained multi-threading, any state that is not maintained as
part of a thread’s context is assumed to have a value of X when symbolic
execution is performed for an instruction belonging to the thread. This leads
to a safe guarantee of peak power for the thread, irrespective of the behavior
of the other threads.
Our technique naturally handles state machines that run synchronously
with the microcontroller. For state machines that run asynchronously (e.g.,
ADCs, DACs, bus controllers), we assume the worst-case power at any in-
stant by separately analyzing the asynchronous state machine to compute
peak power and energy and adding the values to those of the processor.
Asynchronous state machines are generally much smaller than the actual
processor, allowing us to not be overly conservative.
A similar approach can be used to handle interrupts. I.e., offset the peak
power with the worst power consumed during interrupt detection. The effect
of an asynchronous interrupt can be characterized by forcing the interrupt
pin to always read an X. Since this can potentially cause the PC to be
updated with an X, we can force the PC update logic to ignore the interrupt
handling logic’s output. This is achieved by monitoring a particular net in
the design and forcing it to zero every time its value becomes X. Interrupt
service routines (ISRs) are regular software routines and can be analyzed
with the rest of the code.
35
CHAPTER 7
RELATED WORK
Peak power has been analyzed in several settings in the literature. In par-
ticular, several techniques have been proposed to estimate the peak power of
a design. Hsiao et al. [34, 35] propose a genetic algorithm-based estimation
of peak power for a circuit. Wang and Roy [36] use an automatic test gen-
eration technique to compute lower and upper bounds for maximum power
dissipation for a VLSI circuit. Sambamurthy et al. [37] propose a technique
that uses a bounded model checker to estimate peak dynamic power at the
module-level. The technique is also functionally valid at the processor level.
Najeeb et al. [38] propose a technique that converts a circuit behavioral model
to an integer constraint model and employs an integer constraint solver to
generate a power virus that can be used to estimate the peak power of the
processor. To the best of our knowledge, no prior work exists on determining
application-specific peak power for a processor based on symbolic simulation.
The above techniques require a low-level description of the processor (be-
havioral or gate-level). Techniques have also been proposed at the architecture-
level to predict when power exceeds the peak power budget or to lower the
peak-to-average power variation. Sartori and Kumar [39] propose the use of
DVFS techniques to manage peak power in a multi-core system. Kontorinis
et al. [28] propose a configurable core to meet peak power constraints with
minimal impact on performance. Our technique identifies the peak power and
energy requirements of a processor through hardware-software co-analysis.
Estimating peak energy of an application has been previously studied as
the worst case energy consumption (WCEC) problem [22, 40, 41]. How-
ever, prior techniques do not use accurate power models, instead relying on
microarchitectural models, which do not consider the detailed state of a pro-
cessor or input values. As observed by [42], the power of an instruction can
vary based on the previous instructions in the pipeline and its operand val-
ues. Our peak power computation technique analyzes an application on a
36
gate-level processor netlist, allowing us to account for the fine-grained inter-
action between instructions and the worst-case operand values. The result
is an accurate power model that can be used for WCEC analyses such as
the example analysis in Chapter 5. Prior work on worst-case timing analysis
simply identified the timing-critical path through the program. However, the
timing-critical path through a program may not be energy-critical [22, 41].
We calculate energy across all paths through gate-level simulation to deter-
mine the path with highest energy.
Symbolic simulation has been applied in circuits for logic and timing ver-
ification, as well as sequential test generation [18, 43, 44, 45, 46] and deter-
mination of application-specific Vmin [47]. Symbolic simulation has also been
applied for software verification [48]. However, to the best of our knowledge,
no existing technique has applied symbolic simulation to determine the peak
power and energy requirements of an application running on a processor.
37
CHAPTER 8
CONCLUSION
In this thesis, we showed that peak power and energy requirements for an
ultra-low-power embedded processor can be application-specific as well as
input-specific. This renders profiling methods to determine the peak power
and energy of ULP processors ineffective, unless conservative guardbands are
applied, increasing system size and weight. We presented an automated tech-
nique based on symbolic simulation that determines a more aggressive peak
power and energy requirement for a ULP processor for a given application.
We showed that the application-specific peak power and energy requirements
determined by our technique are more accurate, and therefore less conser-
vative, than those determined by conventional techniques. On average, the
peak power requirements determined by our technique are 27%, 26%, and
15% lower than those generated based on design specifications, a stressmark,
and profiling, respectively. Peak energy requirements generated by our tech-
nique are 47%, 26%, and 17% lower, on average, than those generated based
on design specifications, a stressmark, and profiling, respectively. We also
showed that our technique can be used to guide optimizations that target
and reduce the peak power of a processor. Optimizations suggested by our
technique reduce peak power by up to 10% for a set of benchmarks.
38
REFERENCES
[1] H. Blodget, M. Ballve, T. Danova, C. Smith, J. Heggestuen, M. Hoelzel,
E. Adler, C. Weissman, H. King, N. Quah, J. Greenough, and J. Smith,
“The internet of everything: 2015,” BI Intelligence, 2014.
[2] “International Technology Roadmap for Semiconductors 2.0 2015 Edi-
tion Executive Report,” http://www.semiconductors.org.
[3] D. Evans, “The internet of things: How the next evolution of the internet
is changing everything,” April 2011, Cisco Internet Business Solutions
Group.
[4] G. Press, “Internet of Things By The Numbers: Market Estimates And
Forecasts,” Forbes, 2014.
[5] “Microcontroller Sales Regain Momentum After Slump,”
www.icinsights.com/news/bulletins/Microcontroller-Sales-Regian-
Momentum-After-Slump, Feb. 23, 2015.
[6] B. Calhoun, S. Khanna, Y. Zhang, J. Ryan, and B. Otis, “System design
principles combining sub-threshold circuit and architectures with energy
scavenging mechanisms,” in Circuits and Systems (ISCAS), Proceedings
of 2010 IEEE International Symposium on, May 2010, pp. 269–272.
[7] Texas Instruments, “eZ430-RF2500-SEH Solar Energy Harvesting De-
velopment Tool User’s Guide,” 2013.
[8] I. Buchmann, “The Secrets of Battery Runtime,” Battery University,
2016.
[9] K. Furset and P. Hoffman, “High pulse drain impact on CR2032 coin
cell battery capacity,” Nordic Semiconductor and Energizer, 2011.
[10] “Battery energy,” http://www.allaboutbatteries.com/Battery-
Energy.html, 2015.
[11] J. A. Paradiso and T. Starner, “Energy scavenging for mobile and wire-
less electronics,” IEEE Pervasive Computing, vol. 4, no. 1, pp. 18–27,
Jan 2005.
39
[12] “EEMBC, Embedded Microprocessor Benchmark Consortium,”
http://www.eembc.org.
[13] O. Girard, “OpenMSP430 project,” 2013, available at opencores.org.
[14] National Instruments, “Compile Faster with the LabVIEW FPGA
Compile Cloud Service,” Jun 29, 2016. [Online]. Available:
http://www.ni.com/white-paper/52328/en/
[15] Cloud Compiling, “Cloud Compiling,” Jan. 1, 2013. [Online]. Available:
http://www.cloudcompiling.com/
[16] ARM mbed, “ARM mbed IoT Device Platform,” Jun 1, 2016. [Online].
Available: https://www.mbed.com/en/
[17] J. Borgeson, “Ultra-low-power pioneers: TI slashes total MCU
power by 50 percent with new “Wolverine” MCU platform,”
2012, Texas Instruments White Paper. [Online]. Available:
http://www.ti.com/lit/wp/slay019a/slay019a.pdf
[18] R. E. Bryant, “Symbolic Simulation – Techniques and Applications,”
in Proceedings of the 27th ACM/IEEE Design Automation Conference.
ACM, 1991, pp. 517–521.
[19] C. Cadar and K. Sen, “Symbolic execution for software testing: Three
decades later,” Commun. ACM, vol. 56, no. 2, pp. 82–90, Feb. 2013.
[Online]. Available: http://doi.acm.org/10.1145/2408776.2408795
[20] K. Hamaguchi, “Symbolic simulation heuristics for high-level design de-
scriptions with uninterpreted functions,” in High-Level Design Valida-
tion and Test Workshop, 2001. Proceedings. Sixth IEEE International,
2001, pp. 25–30.
[21] Synopsys, PrimeTime User Guide. [Online]. Available:
http://www.synopsys.com/
[22] R. Jayaseelan, T. Mitra, and X. Li, “Estimating the worst-case en-
ergy consumption of embedded software,” in 12th IEEE Real-Time and
Embedded Technology and Applications Symposium (RTAS’06). IEEE,
2006, pp. 81–90.
[23] Synopsys, Design Compiler User Guide. [Online]. Available:
http://www.synopsys.com/
[24] Cadence, Encounter Digital Implementation User Guide. [Online].
Available: http://www.cadence.com/
40
[25] B. Zhai, S. Pant, L. Nazhandali, S. Hanson, J. Olson, A. Reeves, M. Min-
uth, R. Helfand, T. Austin, D. Sylvester et al., “Energy-efficient sub-
threshold processor design,” Very Large Scale Integration (VLSI) Sys-
tems, IEEE Transactions on, vol. 17, no. 8, pp. 1127–1137, 2009.
[26] Y. Kim, L. K. John, S. Pant, S. Manne, M. Schulte, W. L. Bircher,
and M. S. S. Govindan, “Audit: Stress testing the automatic way,”
in Proceedings of the 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO-45. Washington,
DC, USA: IEEE Computer Society, 2012. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2012.28 pp. 212–223.
[27] Intel Corporation, “Intel Pentium 4 Processor in the 423-pin Package
Thermal Design Guidelines,” 2000.
[28] V. Kontorinis, A. Shayan, D. M. Tullsen, and R. Kumar, “Reducing
peak power with a table-driven adaptive processor core,” in
Proceedings of the 42Nd Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM,
2009. [Online]. Available: http://doi.acm.org/10.1145/1669112.1669137
pp. 189–200.
[29] M. Magno, L. Benini, C. Spagnol, and E. Popovici, “Wearable low power
dry surface wireless sensor node for healthcare monitoring application,”
in Wireless and Mobile Computing, Networking and Communications
(WiMob), 2013 IEEE 9th International Conference on. IEEE, 2013,
pp. 189–195.
[30] R. Yu and T. Watteyne, “Reliable, Low Power Wireless Sensor
Networks for the Internet of Things: Making Wireless Sensors
as Accessible as Web Servers,” Linear Technology, 2013. [Online].
Available: http://cds.linear.com/docs/en/white-paper/wp003.pdf
[31] A. Dunkels, J. Eriksson, N. Finne, F. Osterlind, N. Tsiftes, J. Abeille´,
and M. Durvy, “Low-Power IPv6 for the internet of things,” in Net-
worked Sensing Systems (INSS), 2012 Ninth International Conference
on. IEEE, 2012, pp. 1–6.
[32] R. Tessier, D. Jasinski, A. Maheshwari, A. Natarajan, W. Xu, and
W. Burleson, “An energy-aware active smart card,” Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, vol. 13, no. 10, pp.
1190–1199, 2005.
[33] C. Park, P. H. Chou, Y. Bai, R. Matthews, and A. Hibbs, “An ultra-
wearable, wireless, low power ECG monitoring system,” in Biomedical
Circuits and Systems Conference, 2006. BioCAS 2006. IEEE. IEEE,
2006, pp. 241–244.
41
[34] M. S. Hsiao, “Peak power estimation using genetic spot optimization
for large VLSI circuits,” in Proceedings of the conference on Design,
automation and test in Europe. ACM, 1999, p. 38.
[35] M. S. Hsiao, E. M. Rudnick, and J. H. Patel, “K2: an estimator for
peak sustainable power of VLSI circuits,” in Low Power Electronics and
Design, 1997. Proceedings., 1997 International Symposium on. IEEE,
pp. 178–183.
[36] C.-Y. Wang and K. Roy, “Maximum power estimation for CMOS cir-
cuits using deterministic and statistical approaches,” Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, vol. 6, no. 1, pp.
134–140, 1998.
[37] S. Sambamurthy, S. Gurumurthy, R. Vemu, and J. A. Abraham, “Func-
tionally valid gate-level peak power estimation for processors,” in Qual-
ity of Electronic Design, 2009. ISQED 2009. Quality Electronic Design.
IEEE, 2009, pp. 753–758.
[38] K. Najeeb, V. Vardhan, R. Konda, S. Kumar, S. Hari, V. Kamakoti,
and V. M. Vedula, “Power virus generation using behavioral models of
circuits,” in VLSI Test Symposium, 2007. 25th IEEE, May 2007, pp.
35–42.
[39] J. Sartori and R. Kumar, “Distributed peak power management for
many-core architectures,” in Design, Automation Test in Europe Con-
ference Exhibition, 2009. DATE ’09., April 2009, pp. 1556–1559.
[40] P. Wa¨gemann, T. Distler, T. Ho¨nig, H. Janker, R. Kapitza, and
W. Schro¨der-Preikschat, “Worst-case energy consumption analysis for
energy-constrained embedded systems,” in 2015 27th Euromicro Con-
ference on Real-Time Systems. IEEE, 2015, pp. 105–114.
[41] K. Seth, A. Anantaraman, F. Mueller, and E. Rotenberg, “Fast:
Frequency-aware static timing analysis,” ACM Transactions on Embed-
ded Computing Systems (TECS), vol. 5, no. 1, pp. 200–224, 2006.
[42] J. Morse, S. Kerrison, and K. Eder, “On the infeasibility of analysing
worst-case dynamic energy,” arXiv preprint arXiv:1603.02580, 2016.
[43] A. Kolbi, J. Kukula, and R. Damiano, “Symbolic RTL simulation,” in
Design Automation Conference, 2001. Proceedings, 2001, pp. 47–52.
[44] T. Feng, L. C. Wang, K.-T. Cheng, M. Pandey, and M. S. Abadir,
“Enhanced symbolic simulation for efficient verification of embedded
array systems,” in Design Automation Conference, 2003. Proceedings of
the ASP-DAC 2003. Asia and South Pacific, Jan 2003, pp. 302–307.
42
[45] P. Jain and G. Gopalakrishnan, “Efficient symbolic simulation-based
verification using the parametric form of Boolean expressions,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, vol. 13, no. 8, pp. 1005–1015, Aug 1994.
[46] L. Liu and S. Vasudevan, “Efficient validation input generation in RTL
by hybridized source code analysis,” in Design, Automation Test in Eu-
rope Conference Exhibition (DATE), 2011, March 2011, pp. 1–6.
[47] H. Cherupalli, R. Kumar, and J. Sartori, “Exploiting dynamic timing
slack for energy efficiency in ultra-low-power embedded systems,” in
Computer Architecture (ISCA), 2016 43th Annual International Sym-
posium on. IEEE, 2016.
[48] Y. Zhang, Z. Chen, and J. Wang, “Speculative symbolic execution,”
in Software Reliability Engineering (ISSRE), 2012 IEEE 23rd Interna-
tional Symposium on, 2012, pp. 101–110.
43
