HyperPower: Power- and Memory-Constrained Hyper-Parameter Optimization
  for Neural Networks by Stamoulis, Dimitrios et al.
This paper will appear in the proceedings of DATE 2018. Pre-print version, for personal use only.
HyperPower: Power- and Memory-Constrained Hyper-Parameter
Optimization for Neural Networks
Dimitrios Stamoulis∗, Ermao Cai∗, Da-Cheng Juan†, Diana Marculescu∗
∗Department of ECE, Carnegie Mellon University, Pittsburgh, PA
†Google Research, Mountain View, CA
Emails: dstamoul@andrew.cmu.edu, ermao@cmu.edu, dacheng@google.com, dianam@cmu.edu
Abstract—While selecting the hyper-parameters of Neural Net-
works (NNs) has been so far treated as an art, the emergence
of more complex, deeper architectures poses increasingly more
challenges to designers and Machine Learning (ML) practition-
ers, especially when power and memory constraints need to be
considered. In this work, we propose HyperPower, a framework
that enables efficient Bayesian optimization and random search
in the context of power- and memory-constrained hyper-
parameter optimization for NNs running on a given hardware
platform. HyperPower is the first work (i) to show that power
consumption can be used as a low-cost, a priori known con-
straint, and (ii) to propose predictive models for the power and
memory of NNs executing on GPUs. Thanks to HyperPower, the
number of function evaluations and the best test error achieved
by a constraint-unaware method are reached up to 112.99×
and 30.12× faster, respectively, while never considering invalid
configurations. HyperPower significantly speeds up the hyper-
parameter optimization, achieving up to 57.20× more function
evaluations compared to constraint-unaware methods for a
given time interval, effectively yielding significant accuracy
improvements by up to 67.6%.
1. Introduction
Hyper-parameter optimization of Machine Learning al-
gorithms, and especially of Neural Networks (NNs), has
emerged as an increasingly challenging and expensive pro-
cess, dubbed by many researchers to be more of an art
than science. A surprisingly high number of state-of-the-
art methodologies heavily relies on human experts and this
knowledge separates a useless model from cutting edge
performance [1]. Nonetheless, as the design space of hyper-
parameters to be tuned grows, the task of proper hand-
tailored tuning can become daunting [2].
Moreover, the ability of a human expert could be ham-
pered further if we are to consider platform specific test-time
power and memory constraints. We visualize the complexity
of the design space in Figure 1, by reporting testing error
and GPU power consumption for different NN variations
of the AlexNet (CIFAR-10 with Caffe [3] on Nvidia GTX-
1070). We observe that, for a given accuracy level, power
could differ significantly by up to 55.01W (i.e., more than
a third of the GPU Thermal Design Power). Hence, the
more of an art than science rationale could become non-
trivial to exploit in a hardware constrained design space,
necessitating a significant, yet often unavailable, familiarity
of the researcher with the hardware architecture. In addition,
the evaluation of each possible architecture could take hours,
if not days, to train. Thus, traditional techniques for hyper-
parameter optimization, such as grid search, yield poor
results in terms of performance and training time [2].
Bayesian optimization and random search methods
have been shown to outperform human experts in hyper-
parameter optimization for NNs [4] [2] [5]. Nevertheless,
both methods have not been extensively studied in the
context of hardware-constrained hyper-parameter optimiza-
tion. On the one hand, Bayesian optimization has increased
complexity for cases where constraints are not available
a priori [6]. On the other hand, random methods with-
out hardware-aware enhancements could wastefully con-
sider infeasible points that are randomly selected. These
observations constitute the key motivation behind our work.
As a motivating example, hardware-aware hyper-parameter
optimization (based on our framework presented later on)
can find an iso-error NN with power savings of 12.12W
compared to AlexNet, or an iso-power NN with error de-
creased to 21.16 from 24.74%.
Figure 1. Power differs for a given accuracy level, hampering a human
expert’s ability to identify the optimal NN configuration under hardware
constraints (similarly for memory, not shown due to space limitations).
To this end, we propose HyperPower, a framework that
enables efficient Bayesian optimization and random search
in the context of power- and memory-constrained hyper-
parameter optimization for NNs. Our work makes the fol-
lowing contributions:
1) HyperPower is the first work to show that power con-
sumption can be treated as an a priori known constraint
for NN model selection. This insight for low-cost con-
straint evaluation paves the way towards enabling effi-
cient Bayesian optimization and random search methods.
1
ar
X
iv
:1
71
2.
02
44
6v
1 
 [c
s.L
G]
  6
 D
ec
 20
17
Figure 2. Overview of HyperPower flow and illustration of the Bayesian optimization procedure during each iteration n + 1. The ML designer only
provides the NN design space, the target platform, the power/memory budget values, and the number of iterations Nmax. Our goal is to find the NN
configuration with minimum test error under hardware constraints. Bayesian optimization uses a surrogate probabilistic model M to approximate the
objective function; the plots show the mean and confidence intervals estimated with the model (the true objective function is shown for reference, but it is
unknown in practice). At each iteration n+ 1, an acquisition function α(·) is expressed based on the model M and the maximizer of α(·) is selected as
the candidate design point xn+1 to evaluate. HyperPower incorporates the power-memory models directly into the acquisition function formulation, thus
inherently preventing sampling from constraint-violating regions (shaded red). The objective function (NN test error) is evaluated, i.e., the candidate NN
design xn+1 is trained and tested. Then, the probabilistic model M is refined via Bayesian posterior updating based on the new observation. After Nmax
iterations, HyperPower returns the design x∗ with optimal accuracy that satisfies the hardware constraints.
2) To the best of our knowledge, HyperPower is the first
work to propose predictive models for the power and
memory consumption of NNs running on GPUs.
3) HyperPower reaches the number of function evaluations
and the best test error of a constraint-unaware method up
to 112.99× and 30.12× faster, respectively, while never
considering invalid configurations.
4) HyperPower allows for up to 57.20× more function
evaluations compared to a constraint-unaware method
for a given time interval, yielding a significant accuracy
improvement by up to 67.6%.
2. Related work
Modeling hardware metrics: Prior work relies on sim-
plistic proxies of the memory consumption (e.g., counts
of the NN’s weights [6]), or on extrapolation based on
technology node energy tables per operation [7] [8] [9].
Consequently, existing modeling assumptions are either
overly simplifying and have not been compared against real
platforms, or they reflect outdated technology nodes that
are not representative of modern GPU architectures. On
the contrary, we train our models on commercial Nvidia
GPUs and we achieve accurate predictions against actual
hardware measurements. Our recent work also introduces
more elaborate (layer-wise) predictive models for runtime
and energy, which can be incorporated into HyperPower [10]
and which could be flexibly extended to account for process
variations [11], thermal effects [12], and aging [13].
Bayesian optimization under constraints: Prior art
has proposed formulations for constrained Bayesian opti-
mization that generalize the model-based treatment of the
objective to the constraint functions [6]. Herna´ndez-Lobato
et al. have developed a general framework for employing
Bayesian optimization with unknown constraints or with
multiple objective terms [14]. This framework has been
successfully used for the co-design of hardware accelerators
and NNs [15] [16], and the design of NNs under runtime
constraints [14]. However, existing methodologies evaluate
only MNIST on hardware simulators [15] [16], do not
consider power as key design constraint [14], and use a
simplistic count of the network’s weights as a proxy for
the memory constraint. Instead, in our work, we propose an
accurate model for both power and memory that is trained
and tested on different commercial GPUs and datasets.
HyperPower and the proposed power and memory models
can be flexibly incorporated into generic formulations that
support constrained multi-objective optimization [14].
Prior art has motivated optimization cases where the
constraints can be expressed as known a priori [6]; these
formulations enable models that can directly capture candi-
date configurations as valid or invalid [17]. In this work, we
are first to show that both power and memory constraints
can be formulated as constraints known a priori. We exploit
this insight to train predictive models on the power and
memory consumption of NNs executing on state-of-the-art
platforms and datasets, allowing the HyperPower framework
to navigate the design space in a constraint “complying”
manner. We extend the hyper-parameter optimization models
to explicitly account for hardware imposed constraints.
Random (model-free) methods: Random [5] and
random-walk [8] hyper-parameter selection has been shown
to perform well in problems that have low effective dimen-
sionality [1]. Nevertheless, when hardware constraints are
to be accounted for, it is as likely for random methods
to sample a point inside the invalid region as to select a
candidate point outside of it. Our enhancements allow to
quickly discard invalid randomly selected points.
3. HyperPower Framework
The HyperPower framework is illustrated in Figure 2.
The ML practitioner selects the hyper-parameter space
(possible NN configurations), the target platform, and the
power/memory constraints. After Nmax iterations, Hyper-
Power returns the NN with optimal accuracy that satisfies
the hardware constraints. This problem of interest is a
special case of optimizing function f(x) over design space
X and constraints g(x), i.e., minx∈X f(x), s.t. g(x) ≤ c,
where the objective function (i.e., test error of each NN
configuration) has no simple closed form and its evaluations
are costly. To efficiently solve this problem, HyperPower
exploits the effectiveness of Bayesian optimization methods.
2
3.1. Bayesian optimization
Bayesian optimization is a sequential model-based ap-
proach that approximates the objective function with a sur-
rogate (cheaper to evaluate) probabilistic model M, based
on Gaussian processes (GP). The GP model is a probability
distribution over the possible functions of f(x), and it
approximates the objective at each iteration n + 1 based
on data X := xi ∈ Xni=1 queried so far. We assume that
the values f := f1:n of the objective function at points X
are jointly Gaussian with mean m and covariance K, i.e.,
f | X ∼ N (m,K). This formulation intuitively encapsulates
our belief about the shape of functions that are more likely
to fit the data observed so far. Since the observations f are
noisy with additive noise  ∼ N (0, σ2), we write the GP
model as y | f, σ2 ∼ N (f, σ2I). At each point x, GP gives us
a cheap approximation for the mean and the uncertainty of
the objective, written as pM(y|x) and illustrated in Figure 2
with the black curve and the grey shaded areas.
Each iteration n+1 of a Bayesian optimization algorithm
consists of three key steps:
1 Maximization of acquisition function: We first need
to select the point xn+1 (i.e., next candidate NN config-
uration) at which the objective (i.e., the test error of the
candidate NN) will be evaluated next. This task of guiding
the search relies on the so-called acquisition function α(x).
A popular choice for the acquisition function is the Ex-
pectation Improvement (EI) criterion, which computes the
probability that the objective function f will exceed (neg-
atively) some threshold y+, i.e., EI(x) =
∫∞
−∞max{y+ −
y, 0} · pM(y|x) dy. Intuitively, α(x) provides a measure
of the direction toward which there is an expectation of
improvement of the objective function.
The acquisition function is evaluated at different candi-
date points x, yielding high values at points where the GP’s
uncertainty is high (i.e., favoring exploration), and where the
GP predicts a high objective (i.e., favoring exploitation) [1];
this is qualitatively illustrated in Figure 2 (blue curve).
We select the maximizer of α(x) as the point xn+1 to
evaluate next (green triangle in Figure 2). To enable power-
and memory-aware Bayesian optimization, HyperPower in-
corporates hardware-awareness directly into the acquisition
function (subsection 3.4).
2 Evaluation of the objective: Once the current candi-
date NN design xn+1 has been selected, the NN is generated
and trained to completion to acquire the test error. This is
the most expensive step. Hence, our efforts towards enabling
efficient Bayesian optimization, mainly focus on detecting
when this step can be bypassed (subsection 3.2).
3 Probabilistic model update: As the new objective
value yn+1 becomes available at the end of iteration n+1,
the probabilistic model pM(y) is refined via Bayesian pos-
terior updating (the posterior mean mn+1(x) and covariance
covariance Kn+1 can be analytically derived). This step is
quantitatively illustrated in Figure 2 with the black curve
and the grey shaded areas. Please note how the updated
model has reduced uncertainty around the previous samples
and newly observed point. For an overview of GP models
the reader is referred to [1].
Figure 3. Visualizing our insights: how power varies vs accuracy with the
number of training epochs (left); how accuracy can indicate configurations
that do not converge to high-accuracy values (> 10%) (right).
3.2. HyperPower enhancements
Early termination of the NN training at step 2 : First,
we observe that candidate architectures that diverge during
training can be quickly identified only after a few training
epochs (Figure 3 (right)). Please note that this is different
than predicting the final test error of a network, which
could suffer from overestimation issues [18], introducing
artifacts to the probabilistic model. Instead of predicting for
converging cases, we identify diverging cases, allowing the
optimization process to discard low-performance samples.
Power and memory as low-cost constraints: To enable
an efficient formulation with a-priori known constraints, we
observe that power and memory are low-cost constraints
to evaluate. That is, as motivated in prior art for run-
time [6] [15], the power-memory characteristics of an NN
are not affected by the quality of the trained model itself.
In Figure 3 (left), we observe that the NN power values
on Nvidia TX1 with MNIST do not heavily change even
if the NN is trained for more iterations (unlike accuracy,
obviously). We are first to exploit this insight to train predic-
tive models for the power and memory of NN architectures.
More importantly, we use the predictive models to formulate
a power- and memory-constrained acquisition function.
3.3. Power and memory models
To enable a priori power and memory constraint evalua-
tions that are decoupled from the expensive objective evalua-
tion, we propose to model power and memory consumption
of an network as a function of the J discrete (structural)
hyper-parameters z ∈ ZJ+ (subset of x ∈ X ); we train on
the structural hyper-parameters z that affect the NN’s power
and memory (e.g., number of hidden units), since parameters
such as learning rate have negligible impact.
To this end, we employ offline random sampling by
generating different configurations based on the ranges of
the considered hyper-parameters z (discussed in Section 4).
Since the Bayesian optimization corresponds to function
evaluations with respect to the test error [4], for each
candidate design zl we measure the hardware platform’s
power Pl and memory Ml values during inference and not
during the NN’s training. Given the L profiled data points
{(zl, Pl,Ml)}Ll=1, we train the following models that are
linear with respect to both the input vector z ∈ ZJ+ and
model weights w,m ∈ RJ , i.e.:
Power model : P(z) =
J∑
j=1
wj · zj (1)
Memory model : M(z) =
J∑
j=1
mj · zj (2)
3
We train the models above by employing a 10-fold cross
validation on the dataset {(zl, Pl,Ml)}Ll=1. While we exper-
imented with nonlinear regression formulations which can
be plugged-in to the models (e.g., see our recent work [10]),
these linear functions provide sufficient accuracy (as shown
in Section 5). More importantly, we select the linear form
since it allows for the efficient evaluation of the power and
memory predictions within the acquisition function (next
subsection), computed on each sampled grid point of the
hyper-parameter space.
3.4. Constraint-aware acquisition function
HW-IECI: In the context of hardware-constraint opti-
mization, EI allows us to directly incorporate the a priori
constraint information in a representative way. Inspired by
constraint-aware heuristics [6] [17], we propose a power and
memory constraint-aware acquisition function:
a(x) =
∫ ∞
−∞
max{y+ − y, 0} · pM(y|x)·
I[P(z) ≤ PB] · I[M(z) ≤ MB] dy
(3)
where z are the structural hyper-parameters, pM(y|x) is
the predictive marginal density of the objective function
at x based on surrogate model M . I[P(z) ≤ PB] and
I[M(z) ≤ MB] are the indicator functions, which are equal
to 1 if the power budget PB and the memory budget MB
are respectively satisfied. Typically, the threshold y+ is
adaptively set to the best value y+ = maxi=1:n yi over
previous observations [1] [6].
Intuitively, we capture the fact that improvement should
not be possible in regions where the constraints are violated.
Inspired by the integrated expected conditional improve-
ment (IECI) [17] formulation, we refer to this proposed
methodology as HW-IECI. We leave the systematic explo-
ration of other acquisition functions for future work. Note
that uncertainty can be also encapsulated by replacing the
indicator functions with probabilistic Gaussian models as
in [17], whose implementation is already supported by the
used tool [4] and whose analysis we leave for future work.
3.5. Alternative methods supported by HyperPower
Constraints as Gaussian Processes – HW-CWEI: We
also consider the case where the constraints are modeled
by GPs [6] using a latent function gˆ(x) = g0 − g(x) per
constraint. Each GP models the probability of the constraint
being satisfied Pr(C(x)) = Pr(g(x) ≤ g0). In the context
of our approach, to enable efficient constraint evaluation,
we evaluate the latent functions based on our models:
Pr(M(z) ≤ MB) and Pr(P(z) ≤ PB). Inspired by the
Constraint Weighted EI (CWEI) [6] function, we refer to
this second methodology as HW-CWEI.
Random search – Rand: We consider random search
as the popular model-free alternative [5]. Once again, we
exploit the insight of power modeling and early termina-
tion, by replacing the GP-based selection with with random
selection. We denote this method as Rand.
Random walk – Rand-Walk: Random walk methods,
denoted here as Rand-Walk, aim to “tame” the randomness
by tuning the exploitation-exploration trade-off [8]; the next
random point xn+1 is selected around the point x+ with
the best objective value y+ over previous observations.
Formally, at any step we select from within “neighborhood”
controlled by σ20 , i.e., xn+1 ∼ N (x+, σ20).
In Section 5, we show that by exploiting our insights, the
HyperPower implementations of these methods, i.e., Rand,
Rand-Walk, HW-WCEI, and HW-IECI, significantly outper-
form their default, previously published hardware-unaware
counterparts [5] [8] [6] [17], in the context of power- and
memory-constrained hyper-parameter optimization.
4. Experimental Setup
We employ power- and memory-constrained optimiza-
tion with the four discussed methods, on two different
machines, i.e., a server machine with an NVIDIA GTX 1070
and a low-power embedded board NVIDIA Tegra TX1. To
train the predictive models, we profile the CNNs offline
using Caffe [3] on both the CIFAR-10 and MNIST.
We extend and implement the aforementioned four meth-
ods on top of Spearmint [4]. We implement wrapper
scripts around the objective/constraint functions that are
queried by Spearmint, that automate the generation of
Caffe simulations, and power/memory model evaluations.
We employ hyper-parameter optimization on variants of the
AlexNet network for MNIST and CIFAR-10, with six and
thirteen hyper-parameters respectively. For the convolution
layers we vary the number of features (20-80) and the kernel
size (2-5), for the pooling layers the kernel size (1-3), and for
the fully connected layers the number of units (200-700). We
also vary the learning rate (0.001-0.1), the momentum (0.8-
0.95), and the weight decay (0.0001-0.01) values. While the
considered experimental setup serves as a comprehensive
basis to evaluate HyperPower, we are currently considering
larger networks on the state-of-the-art ImageNet dateset as
part of future work.
5. Experimental Results
Proposed power and memory models: First, we assess
the accuracy of the power and memory models (Equations 1-
2). In Figure 5, we plot the predicted and actual power
values, trained on the MNIST and CIFAR-10 networks for
both GTX 1070 and Tegra TX1. Alignment across the blue
line indicates good prediction results. We observe good
prediction for all tested platforms and datasets, with a Root
Mean Square Percentage Error (RMSPE) value always less
than 7% (Table 1) for both power and memory models. It
is worth noticing the power value ranges per device and
that our proposed models can accurately capture both the
high-performance and low-power design regimes.
Fixed number of function evaluations: We first assess
the four methods for fixed number of function evaluations.
We apply each algorithm on the MNIST and CIFAR-10 NNs
with power constraints of 90W and 85W, respectively. As
typical values for the experiments [6] [4] [18], we select a
maximum number of 50 iterations per run (30 for MNIST);
we execute each method five times, and we optimize for and
we report the test error results per iteration.
4
Figure 4. Assessment of the four methods on hyper-parameter optimization on CIFAR-10 CNN. (left) Best observed test error against the number of
function evaluations. (center) Number of constraint-violating samples against the number of function evaluations. (right) Test error per function evaluation.
Figure 5. Actual and predicted power using our models for MNIST and
CIFAR-10, executing on GTX 1070 (left) and Tegra TX1 (right).
TABLE 1. ROOT MEAN SQUARE PERCENTAGE ERROR (RMSPE)
VALUES OF THE PROPOSED POWER AND MEMORY MODELS.
Proposed MNIST CIFAR-10 MNIST CIFAR-10
Model GTX 1070 GTX 1070 Tegra TX1 Tegra TX1
Power 5.70% 5.98% 6.62% 4.17%
Memory 4.43% 4.67% – – 1 – – 1
For the CIFAR-10 case, we plot the results in Figure 4
and we make the following observations. We confirm in
Figure 4 (center) that HW-IECI does not select samples that
violate the constraints. This is significantly beneficial, since
it allows HW-IECI to reach the region around the average
best error in a fifth of function evaluations, as shown in
Figure 4 (left). Finally, the Bayesian optimization methods
outperform both random (model-free) methods. That is, in
Figure 4 (right), it is easy to observe that most points queried
by Bayesian optimization (red circles and blue squares) are
in high-performance regions, while random methods tend to
select points in low-performance regions.
Efficient hyper-parameter optimization via power
modeling and early termination (fixed runtime): Next,
we evaluate the hardware-constrained hyper-parameter op-
timization under maximum wall-clock runtime budget; this
scenario is important in a more commercial-standard context
when executing on a cluster [1] and under pricing schemes
in Infrastructure as a Service systems, where speeding up
the expensive function evaluation addresses not only practi-
cal but also financial limitations related to hyper-parameter
optimization [2], [18]. We therefore repeat the exploration
for three runs per method for each considered device-
dataset pair with the following constraints constraints: 85W
and 1.15 for MNIST on GTX 1070, 90W and 1.25GB
for CIFAR-10 on GTX 1070, 10W for MNIST on Tegra
TX1, and 12W for CIFAR-10 on Tegra TX1 (no memory
constraints on Tegra1). To impose upper runtime constraints,
each method keeps querying new samples as long as the total
wall-clock timestamp is less than two hours and five hours
for MNIST and CIFAR-10, respectively; please note that
we allow the last sample queried right before the maximum
time limit to complete (as seen in Table 3, where the average
runtime is slightly above the two and five hours spans).
1. Tegra does not support NVML API for memory measurements, and
the tegrastats command reports utilization and not memory consump-
tion; for representative comparison, we do not consider memory on Tegra.
Figure 6. Capturing the benefit of using early termination and the
power/memory models to all four considered methods; best test error on
CIFAR-10 NN against the total hyper-parameter optimization runtime.
First, we visualize in Figure 6 the benefit that the
power/memory models and the early termination offer in
HyperPower. For CIFAR-10 on GTX 1070, we repeat the
5-hour execution for each method in an exhaustive manner,
where these two enhancements are deactivated (the exhaus-
tive versions are shown with dotted lines in Figure 6). We
observe that all four methods reach a high-performance
region faster that the default (exhaustive) methods, which
can be seen with all solid lines lying to the left of the dotted
ones. Second, we observe the density of the samples along
the solid lines; this is to be expected, since low-performance
or violating samples can be quickly discarded.
We present the results for all considered methods and
all device-dataset pairs in Tables 2-5, where we compare
against the constraint-unaware implementations of those
methods (these exhaustive cases are denoted as default,
and the average speedup values are computed as the geomet-
ric mean across all runs per case). First, in Table 2 we report
the mean and the standard deviation of the best test error
achieved by each method. As expected, the constraint-aware
versions of all four methods supported by HyperPower,
outperform their respective constraint-unaware counterparts,
with accuracy increase by up to 67.6% for the case of Rand
on CIFAR-10 with Tegra TX1.
It is important to note that HyperPower results dis-
play less variance compared to all the constraint-unaware
versions. For the random-based methods, this is because
several runs completely failed to find a high-performance
region. This is to be expected since the default exhaustive
methodologies are agnostic to constraints, hence they could
keep wastefully sampling constraint-violating designs. An
extreme case of this inefficiency could be seen for both
CIFAR-10 cases solved with Rand-Walk, which both failed
to reach a feasible solution. This highlights the key disad-
vantage of Random Walk methods due to the sensitivity of
their performance to the selection of the proper σ0 value,
which defeats the purpose of automated hyper-parameter op-
timization altogether. These observations for vanilla random
search methods [5] [8] has significant implications, since a
total of 32 hours of server runtime was inefficiently wasted.
5
TABLE 2. MEAN BEST TEST ERROR (AND STANDARD DEVIATION IN PARENTHESIS) ACHIEVED PER METHOD.
MNIST – GTX 1070 CIFAR-10 – GTX 1070 MNIST – Tegra TX1 CIFAR-10 – Tegra TX1
Solver Default HyperPower Default HyperPower Default HyperPower Default HyperPower
Rand 60.59% (41.46%) 1.01% (0.18%) 69.60% (28.85%) 24.39% (3.08%) 1.06% (0.08%) 0.97% (0.14%) 74.35% (22.13%) 24.09% (1.97%)
Rand-Walk 31.16% (41.55%) 0.84% (0.08%) – 22.88% (0.87%) 1.04% (0.05%) 0.90% (0.12%) – 21.90% (0.59%)
HW-CWEI 0.97% ( 0.19%) 0.85% (0.12%) 22.09% ( 1.01%) 22.09% (0.35%) 0.98% (0.13%) 0.91% (0.07%) 24.28% ( 0.60%) 22.99% (0.41%)
HW-IECI 0.81% ( 0.05%) 0.81% (0.02%) 22.35% ( 1.71%) 21.81% (0.38%) 0.81% (0.02%) 0.79% (0.03%) 23.35% ( 1.04%) 21.95% (0.65%)
TABLE 3. RUNTIME (HOURS) FOR HyperPower METHODS TO REACH THE NUMBER OF SAMPLES THAT THEIR EXHAUSTIVE COUNTERPARTS QUERIED.
MNIST – GTX 1070 CIFAR-10 – GTX 1070 MNIST – Tegra TX1 CIFAR-10 – Tegra TX1
Solver Default HyperPower Speedup Default HyperPower Speedup Default HyperPower Speedup Default HyperPower Speedup
Rand 2.14 0.02 101.46× 5.25 0.22 30.31× 2.08 0.49 4.31× 5.35 0.74 11.78×
Rand-Walk 2.17 0.02 112.99× 5.29 0.46 17.45× 2.14 1.00 2.15× 5.31 1.13 21.00×
HW-CWEI 2.00 0.30 10.22× 5.06 2.46 2.07× 2.16 1.32 1.65× 5.39 1.10 8.06×
HW-IECI 2.02 1.81 1.13× 5.12 2.97 1.74× 2.02 1.65 1.22× 5.22 1.71 3.48×
TABLE 4. INCREASE IN THE NUMBER OF SAMPLES THAT EACH METHOD WAS ABLE TO QUERY.
MNIST – GTX 1070 CIFAR-10 – GTX 1070 MNIST – Tegra TX1 CIFAR-10 – Tegra TX1
Solver Default HyperPower Increase Default HyperPower Increase Default HyperPower Increase Default HyperPower Increase
Rand 14.00 796.33 57.20× 14.67 405.33 27.88× 13.00 35.67 2.77× 13.33 262.33 20.00×
Rand-Walk 15.00 316.67 19.16× 13.33 118.33 8.86× 14.00 30.67 2.12× 14.33 88.67 5.46×
HW-CWEI 21.67 62.67 2.79× 28.00 38.67 1.38× 11.00 14.67 1.35× 13.00 27.33 1.97×
HW-IECI 53.00 60.33 1.14× 29.00 43.33 1.49× 46.33 54.67 1.18× 11.00 20.00 1.75×
TABLE 5. IMPROVEMENT IN RUNTIME (HOURS) TO ACHIEVE THE BEST ACCURACY THAT THE EXHAUSTIVE METHODS DID.
MNIST – GTX 1070 CIFAR-10 – GTX 1070 MNIST – Tegra TX1 CIFAR-10 – Tegra TX1
Solver Default HyperPower Speedup Default HyperPower Speedup Default HyperPower Speedup Default HyperPower Speedup
Rand 0.50 0.16 1.56× 2.09 0.53 3.97× 0.72 0.16 3.64× 2.63 0.48 4.54×
Rand-Walk 0.69 0.12 4.72× – – – 1.74 0.27 6.18× – – –
HW-CWEI 3.47 3.06 6.11× 4.12 2.58 2.08× 1.30 0.19 7.39× 5.05 1.24 4.80×
HW-IECI 1.61 0.10 30.12× 4.45 2.49 2.13× 1.53 0.14 11.30× 3.24 1.16 2.69×
Moreover, among all four versions supported by Hy-
perPower, we can observe that HW-IECI always achieves
the best results, which shows the importance of enabling a-
priori constraint evaluation through our predictive models.
Finally, we observe that HyperPower’s enhancements allow
for up to 57.20× more function evaluations (see Table 4).
In terms of runtime improvement, it takes HyperPower
up 112.99× faster to reach the same number of function
evaluations that default methods queried (see Table 3),
which attests to the importance of hardware-awareness when
power/memory violating samples can be quickly discarded.
Most importantly, thanks to HyperPower, we can reach the
best test error achieved by the exhaustive methods up to
30.12× faster (see Table 5).
6. Conclusion
Accounting for power and memory constraints could
significantly impede the effectiveness of traditional hyper-
parameter optimization methods to identify optimal NN
configurations. In this work, we proposed HyperPower,
a framework that enables efficient hardware-constrained
Bayesian optimization and random search. We showed that
power consumption can be used as a low-cost, a priori
known constraint, and we proposed predictive models for the
power and memory of NNs executing on GPUs. Thanks to
HyperPower, we reached the number of function evaluations
and the best test error achieved by a constraint-unaware
method up to 112.99× and 30.12× faster, respectively, while
never considering invalid configurations under the proposed
HW-IECI acquisition function. By significantly speeding up
the hyper-optimization with up to 57.20× more function
evaluations compared to constraint-unaware methods for a
given time interval, HyperPower yielded significant accu-
racy improvements by up to 67.6%.
Acknowledgments
This research was supported in part by NSF CNS Grant
No. 1564022.
References
[1] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,
“Taking the human out of the loop: A review of bayesian optimiza-
tion,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[2] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task bayesian op-
timization,” in Advances in neural information processing systems,
2013, pp. 2004–2012.
[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[4] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian op-
timization of machine learning algorithms,” in Advances in neural
information processing systems, 2012, pp. 2951–2959.
[5] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
optimization,” Journal of Machine Learning Research, vol. 13, no.
Feb, pp. 281–305, 2012.
[6] M. A. Gelbart, J. Snoek, and R. P. Adams, “Bayesian optimization
with unknown constraints,” arXiv preprint arXiv:1403.5607, 2014.
[7] B. D. Rouhani, A. Mirhoseini, and F. Koushanfar, “Delight: Adding
energy dimension to deep neural networks,” in Proceedings of the
2016 International Symposium on Low Power Electronics and Design.
ACM, 2016, pp. 112–117.
[8] S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer, “Neural
networks designing neural networks: Multi-objective hyper-parameter
optimization,” in Computer-Aided Design (ICCAD), 2016 IEEE/ACM
International Conference on. IEEE, 2016, pp. 1–8.
[9] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
connections for efficient neural network,” in Advances in Neural
Information Processing Systems, 2015, pp. 1135–1143.
[10] E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu, “Neuralpower:
Predict and deploy energy-efficient convolutional neural networks,”
arXiv preprint arXiv:1710.05420, 2017.
[11] D. Stamoulis and D. Marculescu, “Can we guarantee performance
requirements under workload and process variations?” in Proceedings
of the 2016 International Symposium on Low Power Electronics and
Design. ACM, 2016, pp. 308–313.
6
[12] E. Cai, D. Stamoulis, and D. Marculescu, “Exploring aging deceler-
ation in finfet-based multi-core systems,” in Computer-Aided Design
(ICCAD), 2016 IEEE/ACM International Conference on. IEEE,
2016, pp. 1–8.
[13] D. Stamoulis, D. Rodopoulos, B. H. Meyer, D. Soudris, F. Catthoor,
and Z. Zilic, “Efficient reliability analysis of processor datapath using
atomistic bti variability models,” in Proceedings of the 25th edition
on Great Lakes Symposium on VLSI. ACM, 2015, pp. 57–62.
[14] J. M. Herna´ndez-Lobato, M. A. Gelbart, R. P. Adams, M. W. Hoff-
man, and Z. Ghahramani, “A general framework for constrained
bayesian optimization using information-based search,” 2016.
[15] J. M. Herna´ndez-Lobato, M. A. Gelbart, B. Reagen, R. Adolf,
D. Herna´ndez-Lobato, P. N. Whatmough, D. Brooks, G.-Y. Wei, and
R. P. Adams, “Designing neural network hardware accelerators with
decoupled objective evaluations,” in NIPS workshop on Bayesian
Optimization, 2016, p. l0.
[16] B. Reagen, J. M. Herna´ndez-Lobato, R. Adolf, M. Gelbart, P. What-
mough, G.-Y. Wei, and D. Brooks, “A case for efficient accelerator
design space exploration via bayesian optimization,” in Low Power
Electronics and Design (ISLPED, 2017 IEEE/ACM International
Symposium on. IEEE, 2017, pp. 1–6.
[17] R. B. Gramacy and H. K. Lee, “Optimization under unknown con-
straints,” arXiv preprint arXiv:1004.4027, 2010.
[18] T. Domhan, J. T. Springenberg, and F. Hutter, “Speeding up automatic
hyperparameter optimization of deep neural networks by extrapolation
of learning curves.” in IJCAI, 2015, pp. 3460–3468.
7
