Optimizing performance per watt on GPUs in High Performance Computing:
  temperature, frequency and voltage effects by Price, D. C. et al.
Noname manuscript No.
(will be inserted by the editor)
Optimizing performance-per-watt on GPUs in High
Performance Computing
Temperature, frequency and voltage effects
D. C. Price · M. A. Clark · B. R. Barsdell · R. Babich · L. J. Greenhill
Received: date / Accepted: date
Abstract The magnitude of the real-time digital sig-
nal processing challenge attached to large radio astro-
nomical antenna arrays motivates use of high perfor-
mance computing (HPC) systems. The need for high
power efficiency at remote observatory sites parallels
that in HPC broadly, where efficiency is a critical met-
ric. We investigate how the performance-per-watt of
graphics processing units (GPUs) is affected by tem-
perature, core clock frequency and voltage. Our results
highlight how the underlying physical processes that
govern transistor operation affect power efficiency. In
particular, we show experimentally that GPU power
consumption increases non-linearly (quadratic) with both
temperature and supply voltage, as predicted by phys-
ical transistor models. We show lowering GPU supply
voltage and increasing clock frequency while maintain-
ing a low die temperature increases the power efficiency
of an NVIDIA K20 GPU by up to 37-48% over default
settings when running xGPU, a compute-bound code
used in radio astronomy. We discuss how automatic
temperature-aware and application-dependent voltage
and frequency scaling (T-DVFS and A-DVFS) may pro-
vide a mechanism to achieve better power efficiency for
a wider range of compute codes running on GPUs.
Keywords performance per watt · power efficiency ·
radio astronomy · HPC · GPU · DVFS
The authors acknowledge support from NSF grants PHYS-
080357, AST-1106059, and OIA-1120587. BB thanks the
NVIDIA internship program for support.
D. C. Price, B. R. Barsdell, L. J. Greenhill
Harvard-Smithsonian Center for Astrophysics, MS 42, 60
Garden Street, Cambridge MA 01238 USA
E-mail: dprice@cfa.harvard.edu
M. A. Clark, R. Babich
NVIDIA, 2701 San Tomas Expy, Santa Clara, CA 95050 USA
1 Introduction
Power efficiency is a crucial design factor within HPC.
Power consumption is often a limiting factor for HPC
systems, with current generation machines already re-
quiring power budgets >1 MW to operate1.
In order to build Exascale systems (> 1018 floating
point operations per second, i.e. FLOPS), increasing
the achieved performance-per-watt of HPC hardware
is of paramount importance for several reasons. Firstly
and foremostly, the more power consumed, the more
the system costs to operate. Secondly, generation and
distribution of power is non-trivial on megawatt scales.
In addition, waste heat poses an engineering challenge:
it must be removed to avoid compute nodes overheating
and failing. HPC cooling systems often require signif-
icant amounts of power themselves. Decreasing com-
pute power consumption in turn decreases infrastruc-
ture power consumption and as such is the most promis-
ing way to increase overall power efficiency.
Machines based upon graphic processing units (GPUs)
dominate the Green 500 list, with 9 of the top 10 ma-
chines featuring GPUs. Indeed, two of the top 10 most
powerful computers on the June 2014 Top 500 list2, Ti-
tan (2nd) and Piz Daint (6th), utilize Kepler GPUs. As
such, the question of how best to increase GPU power
efficiency is pressing.
In this paper, we investigate how performance-per-
watt can be optimized for an NVIDIA K20 GPU. We
approach the problem by considering the physical pro-
cesses that govern transistor performance; in particular,
how temperature, supply voltage, and clock frequency
affect power efficiency.
1 http://www.green500.org/lists/green201411
2 http://www.top500.org/lists/2014/11/
ar
X
iv
:1
40
7.
81
16
v3
  [
as
tro
-p
h.I
M
]  
20
 O
ct 
20
15
2 D. C. Price et al.
1.1 Power efficiency and GPUs in radio astronomy
The proposed all-sky imaging element of the Long Wave-
length Array (LWA) [4], with ∼ 0.1km2 collecting area
and the proposed Square Kilometre Array (SKA) tele-
scope3 will demand computation at peta- and exa-scale
processing respectively (e.g. [1]). Due to the number of
computations required, power efficiency is of particular
concern for these and other next-generation radio tele-
scopes. Constrained operational budgets further dictate
that strict power usage targets must be met. In the past,
custom hardware has been built to perform the required
digital signal processing tasks. For example, the VLA
WIDAR and ALMA correlators [19, 22] are very capa-
ble and power efficient signal processing systems, but
they lack the flexibility afforded by architectures where
processing is performed on general purpose computing
platforms; they also took over a decade to design and
implement. It has been shown that GPU-based signal
processing systems for radio astronomy can be designed
and deployed in a fraction of this time, see for example
[10].
GPUs are well-suited to many of the signal pro-
cessing tasks required in radio astronomy. If power effi-
ciency challenges can be met, then a GPU-based HPC
system would be an attractive solution for SKA signal
processing. A GPU-based implementation of the SKA1-
Low central signal processor (a subsystem of the full
SKA) is calculated to require ∼335 kW, based on cur-
rent NVIDIA Kepler GPU architecture [15]. This as-
sumes a GPU power efficiency of 12 GFLOPS/W; we
report 18.3 GFLOPS/W for the xGPU cross-correlation
code, after temperature-aware tuning of GPU supply
voltage and frequency. Whether or not GPUs are con-
sidered for SKA1-Low signal processing hardware will
depend upon demonstrating that GPUs can achieve ac-
ceptable power efficiency within the next few years.
1.2 Power leakage
The physical processes that underlie power usage are
common across all architectures. These processes can be
broadly broken into two categories: static and dynamic.
That is, total power usage Psys is given by
Psys = Pstatic + Pdynamic. (1)
The dynamic power is the power consumed in switching
logic states, given for a single logic component by
Pdynamic = CV
2
ddfclock,
where C is the load capacitance, Vdd is the voltage
swing and fclock is the switching frequency. For a chip
3 http://www.skatelescope.org
with many logic components, the dynamic power is the
sum of the contributions of all Nc components:
Pdynamic =
Nc∑
n=1
CnV
2
n fn, (2)
which for devices with a single clock domain (i.e. switch-
ing frequency), voltage swing Vdd, and identical logic
components simplifies to NcCV
2
ddfclock.
Static power, also known as leakage power, is con-
sumed regardless of transistor switching and is due to
current leakage; more detailed discussion of these mech-
anisms can be found in [13, 14]. For sub-micrometer
processes (i.e. most current-generation compute archi-
tectures), subthreshold leakage is the dominant mech-
anism.
It is informative to consider an analytical expression
for subthreshold leakage. As shown in [14], Isub of a
MOS device can be expressed as
Isub = As
W
L
(
kT
q
)2
e
q(Vgs−Vthr)
nkT , (3)
where As is a technology-dependent constant, W and
L are device’s effective channel width and length, Vgs
is gate-to-source voltage, and n is the transistor’s sub-
threshold swing coefficient. The quantity kT/q is the
thermal voltage, where k is Boltzmann’s constant, q is
electron charge, and T is temperature. The threshold
voltage Vthr is also a (non-linear) function of tempera-
ture, decreasing with increasing temperature.
Eq. 3 predicts that subthreshold leakage current ex-
hibits a non-linear temperature dependence, propor-
tional to Isub ∝ T 2e−b/T , where b > 0. Here, the expo-
nent is necessarily negative, as (VGS−Vthr) < 0 (by defi-
nition of subthreshold), q/k ≈ 11605 K/V, and n ≥ 1; it
follows that Eq. 3 monotonically increases with temper-
ature. This implies that power efficiency of a transistor
increases with decreasing temperature. One therefore
expects to see performance-per-watt of GPUs improve
as die temperature is lowered.
1.3 Maximizing power efficiency
Maximizing power efficiency (ηpow), requires simulta-
neous optimization of power consumption and compu-
tational performance. In tension with Eq. 1-3, compute
performance (NOPS), increases linearly with clock fre-
quency. That is, maximum power efficiency is given by
ηpow =
NOPS
Ptotal
=
NOPS
Pdynamic + Pstatic
. (4)
For a simple chip with full utilization of Nc identical
compute components, each performing one operation
Optimizing performance-per-watt on GPUs in High Performance Computing 3
per clock cycle, NOPS = Ncfclock, where fclock is the
clock frequency,
ηpow =
Ncfclock
NcCV 2ddfclock + Pstatic
. (5)
Eq. 5 shows that power efficiency is increased when volt-
age is decreased. Due to the Pstatic term in the denom-
inator, efficiency also increases with clock frequency.
For a complex chip such as a GPU this formalism
is a simplification. Another consideration is that fre-
quency and voltage are generally scaled together, not
separately. This is primarily as the speed at which a
digital circuit can switch states from low to high — the
gate delay time tdelay — is
tdelay ∝ VddT
µ
(Vdd − Vthr)ξ , (6)
where ξ and µ are technology-dependent constants [13].
The temperature dependence of Eq. 6 arises as temper-
ature affects carrier mobility and threshold voltage. At
higher frequencies, there is more dynamic power usage,
so die temperature will in turn increase, forcing higher
Vdd to maintain suitable tdelay (which in turn increases
power usage and die temperature).
Nonetheless, the default clock-voltage combination
has been shown to be conservative on some GPUs (see
[11, 16]).
1.4 GPU power measurement and modeling
The simplified power efficiency formula presented in
Eq. 5 is not immediately applicable to GPUs, which
feature hierarchical memory, different clock domains,
multiple instructions, and dynamic control of voltage
and clock frequency (DVFS). As such, there have been
many analyses at higher abstraction levels that quantify
the power characteristics of GPU hardware and provide
models that predict power usage [3, 5, 6, 8, 9, 16, 17,
20, 21]. Our work differs in that we consider temper-
ature, voltage and frequency as independent variables
over which to optimize performance-per-watt. That is,
we consider power efficiency ηpow = ηpow(Vdd,fclock,T ).
While voltage and frequency have previously been
explored in GPU DVFS studies [5, 16, 18], we explore
a larger parameter space. Apart from in Hong et. al.
[6], temperature effects on GPU power efficiency have
been ignored. This is detrimental to GPU power model
accuracy and to achieving optimal power efficiency, as
discussed in Liao et. al. [12, 13]. We show that the sim-
plified linear model of Hong et. al. [6] is not sufficient
for predicting power usage on current generation GPUs.
To the authors knowledge, this is the first time the non-
linear effect of temperature upon GPU efficiency has
been studied in public literature.
The remainder of this paper is organized as follows.
In Section 2, we introduce the hardware and software
used to find optimal power efficiency on an NVIDIA
K20 GPU. Our results are then presented in Section 3;
this is followed by discussion (Section 4) and conclu-
sions (Section 5).
2 Materials and Methods
2.1 Hardware overview
The work presented here was conducted on “GreenGPU”,
a custom-built computer system. GreenGPU consists
of a Gigabyte GA-Z68MX motherboard with an Intel
i7-2600 CPU, 16 GiB of DDR3 RAM, and an NVIDIA
Tesla K20 GPU. The default heatsink of the K20 was re-
placed with an EK-FCTK20 water block, and a Swiftech
water cooling system (MCP655) was installed. Water
cooling was added to give access to a wider range of
temperatures than possible using air cooling and to
provide control of coolant flow. The operating system
used for testing was 64-bit Linux Ubuntu 12.04 LTS,
with NVIDIA GPU driver version 319.37 installed. A
Windows 7 partition was also installed in order to run
Windows-only GPU firmware modification tools.
2.2 Clock and voltage management
To control the clock frequency and voltage of the K20
GPU, we used three tools: nvidia-smi4, GPU-Z5 and Kepler
BIOS Tweaker6. The nvidia-smi utility, or NVIDIA Sys-
tem Management Interface, is a command line utility
that allows for the GPU core frequency to be altered;
the allowed values are dependent upon the GPU (Ta-
ble 1). The nvidia-smi tool also allows for power draw
and GPU die temperature to be read from GPU sen-
sors, giving an accurate way to measure temperature
and power, with differences between power and temper-
ature reliable to within ±1 W and ±1◦C. The reported
power is the full-board power consumption, which in-
cludes memory and voltage regulators.
For finer grain control over core voltage and fre-
quency, and so that we could tune these as indepen-
dent parameters, we used the GPU-Z tool v0.7.7 and
Kepler BIOS Tweaker tool v1.27. GPU-Z is a utility that
displays GPU specifications and operating parameters,
and allows for GPU firmware to be downloaded from
4 https://developer.nvidia.com/nvidia-system-
management-interface
5 http://www.techpowerup.com/gpuz/
6 http://www.softpedia.com/get/System/Benchmarks/Kepler-
BIOS-Tweaker.shtml
4 D. C. Price et al.
Table 1: Supported K20 core and memory clock pairs
GDDR5 Freq. Core Freq. GPU Core Voltage
(MHz) (MHz) State ID (mV)
2600 758 V5 987.5-1112.5
705 V4∗ 950-1062.5
666 V3 925-1050
640 V2 912.5-1025
614 V1 900-1000
324 324 V0 875 - 875
∗Default value
the GPU. The Kepler BIOS Tweaker tool allows for mod-
ification of the parameters within GPU firmware, such
as voltage and clock frequency. While benchmarking
was run on the Ubuntu partition of GreenGPU, these
two programs were run on the Windows 7 partition.
Note that flashing firmware using tools such as Kepler
BIOS Tweaker will void warranty and can potentially cause
damage to the GPU.
2.3 xGPU cross-correlation code
For benchmarking and power efficiency testing, we used
the xGPU CUDA library7 [2]. xGPU computes the cross-
correlation of time-series data of N inputs and is used
for interferometric synthesis imaging in radio astron-
omy, see for example [10]. It is virtually identical to the
BLAS routine CHERK — Complex Hermitian Rank K
update — where the T × N matrix, corresponding to
time series data (T dimension) from N antennas is mul-
tiplied by its complex conjugate, producing an N ×N
Hermitian matrix. The problem is compute-bound be-
cause the compute complexity scales as N2T , whereas
the memory traffic scales as N(T + N). Results pre-
sented here used values N = 8192 and T = 1000. The
xGPU code differs from cuBLAS CHERK as it contains
domain-specific tweaks: it is designed to process 8-bit
integer input (processed as 32-bit floating point), only
stores the lower triangle of the correlation matrix, and
uses smaller tiles to improve performance for small-N .
xGPU also has an additional parameter corresponding to
the number of frequency channels to process; the prob-
lem is trivially parallelizable over frequency channels,
so this can be thought of as a batching parameter.
Here, we use the xGPU application because that is
our domain of interest; however, we note that given it
is compute-bound, it is well suited to our investigation:
when running this algorithm, most of the power is con-
sumed by the floating point units, and this increases
the validity of the simple model in Section 1.2.
7 https://github.com/GPU-correlators/xGPU
xGPU has two different modes that were of particular
use for this work. The first mode is a benchmark, which
computes various performance metrics achieved, such
as FLOPS, for a given set of compile-time parameters.
The output of the GPU code is also compared against
CPU code for validation. The second mode is a power
diagnostic loop, in which xGPU is fed dummy data and
run in an infinite loop, so as to keep the GPU running
continuously.
For the compile-time parameters used, the single-
precision computational performance p in FLOPS for
xGPU was found to follow p = 2.89fMHz, for clock fre-
quencies between 614-1070 MHz, with a maximum per-
formance of 3094 GFLOPS at 1070 MHz. Note that
temperature and voltage do not affect achieved FLOPS.
2.4 Performance profiling method
The main parameters used for testing power efficiency
in this work were GPU die temperature, GPU core volt-
age and clock frequency. We used Kepler BIOS Tweaker
and nvidia-smi to modify the GPU core voltage and
clock frequency, then we used xGPU to benchmark perfor-
mance. Attempts to vary the memory clock frequency
resulted in the GPU being inoperable, so no memory
clock adjustments were conducted.
Thermal control of the GPU die was achieved by
running xGPU in a power loop, while controlling the flow
of water through the water cooling system. In order
to continuously monitor the temperature and power
draw, we used a Python script to parse the output of
nvidia-smi and to log timestamped power usage and
temperature data to file every second. By running this
script in tandem with xGPU, we tested the performance
of the K20 GPU over a variety of core frequency and
voltage combinations.
3 Results
3.1 Overclocking at constant temperature
After profiling the computational performance of xGPU,
we compared power usage of the GPU at different (fclock,
Vdd) combinations. As shown in Table 1, the K20 has
preset frequency-voltage combinations that can be se-
lected with nvidia-smi. We applied frequency offsets of
0-300 MHz to these default values, in 60 MHz incre-
ments, and then measured the resulting power usage for
the xGPU code (Fig. 1a), and the corresponding power
efficiency (Fig. 1b). Voltage states are labelled V1-V5,
with increasing voltage; for these data, the firmware
Optimizing performance-per-watt on GPUs in High Performance Computing 5
600 700 800 900 1000 1100
Frequency (MHz)
120
140
160
180
200
220
P
ow
er
(W
)
V1
V2
V3
V4
V5
(a) Measured GPU power usage.
600 700 800 900 1000 1100
Frequency (MHz)
12.5
13.0
13.5
14.0
14.5
15.0
15.5
16.0
16.5
G
F
L
O
P
S
/W
V1
V2
V3
V4
V5
(b) Measured GPU power efficiency.
Fig. 1: GPU power usage and efficiency for xGPU code running on a K20 GPU, for default voltages (V1-V5) with
frequency offsets of 0-300 MHz over default fclock settings (see Table 1).
voltage table was not modified. To account for temper-
ature effects, we held GPU die temperature at 34±2◦C.
The default voltage state (V4) with default fclock of
705 MHz yields a power efficiency of 13.6 GFLOPS/W.
We find a peak power efficiency of 16.0 GFLOPS/W
when using the lowest voltage state with a fclock of
914 MHz, an increase of 18%. The worst power effi-
ciency was achieved when using the highest voltage level
with its default fclock of 758 MHz. The dip in power us-
age at fclock= 705 MHz when in the V2 state is likely
due to the GPU selecting a low core voltage within the
allowed range (see Table 1).
3.2 Temperature dependence of power efficiency
Eq. 3 predicts that subthreshold leakage current is pro-
portional to T 2e−b/T . To investigate this, we compared
power efficiency of the GPU at various die tempera-
tures,where we have averaged multiple data into bins of
±1◦C (Fig. 2), the clock frequency was set to 705, 805
and 905 MHz, with the default core voltage state (950-
1062.5 mV). At all temperatures, power usage changes
by a fixed ∼ 0.14 W/MHz. As clock frequency does not
affect static power Pstatic, the offset between lines cor-
responds to the dynamic power Pdynamic component of
the total power usage.
We also see a non-linear increase of power consump-
tion as a function of temperature; the simple linear
model as presented in [6] is not sufficient. If we take
into account Pdynamic, we can fit a model , Pstatic =
aT 2e−b/T +c to all three runs (solid lines). For temper-
ature in Kelvin, a least-square fit yields a = 1.00±0.23,
b = 3209.7± 83.7, c = (148.9± 0.2, 162.7± 0.2, 176.9±
0.2) for 705, 805 and 905 MHz, respectively.
Power efficiency is improved as clock frequency is
increased. There is an 18% difference in power efficiency
between worst (705 MHz at 90◦C) and best (905 MHz
at 30◦C) cases. At constant T=30◦C, the performance
at 905 MHz is 14.6 GFLOPS/W, as opposed to 13.5
GFLOPS/W at 705 MHz; an 8.1% increase.
3.3 Constant frequency, modified voltage
The default voltage states of the K20 are not fixed volt-
ages, but rather a range (Table 1). To investigate the
effect of voltage on power efficiency, we reprogrammed
the K20’s firmware so that the GPU core voltages V1-
V5 were fixed to the lower bound of the default volt-
age ranges (Table 1). Power efficiency as a function
of temperature for the modified voltage levels V1-V5
is shown in Fig. 3, for a constant clock frequency of
fclock =800 MHz. As voltage is increased, power effi-
ciency decreases. The highest efficiency of 14.7 GFL-
OPS/W was achieved using the V1 state, while the
V5 state yielded 12.6 GFLOPS/W, the lowest for these
tests. This corresponds to a 16.7% difference in power
efficiency between best and worst cases.
Apparent in Fig. 3 are unexpected discontinuous
jumps in the reported power usage. These drops are
repeatable and occur at different temperatures for dif-
ferent voltage states. We are uncertain as to the cause;
however, nvidia-smi does not report clock throttling
and no decrease in performance is seen. The altered
voltage table (as written in the GPU’s firmware) did
not allow for different voltage states, and the K20 GPU
does not employ temperature-dependent voltage scal-
ing. We conclude that this is due to an unknown off-
chip (i.e. off-processor) effect. A possible explanation
is that this is due to current-dependent efficiencies in
6 D. C. Price et al.
30 40 50 60 70 80 90
Temperature (C)
12.0
12.5
13.0
13.5
14.0
14.5
15.0
G
F
L
O
P
S
/W
705 MHz
805 MHz
905 MHz
Fig. 2: xGPU efficiency on a K20 GPU, with default core
voltage (V4). Power usage shows a strong non-linear
temperature dependence; this decreases performance
per watt as temperature increases. Here, T = 30◦C.
power delivery of the regulators that supply the GPU
die with power.
3.4 Tuning voltage and frequency
The best performance-per-watt is achieved when un-
dervolting and overclocking the GPU, as predicted in
Section 1.3 (Table 2). At 900 mV, xGPU code execu-
tion fails and the GPU froze when attempting to run
the code at 1005 MHz. At 955 MHz, the code ran suc-
cessfully but the output failed verification when GPU
temperature was above 70◦C; that is, it did not match
the output of reference CPU code. No numerical errors
were found for temperatures below 70◦C. At 875 mV,
we achieved a maximum clock frequency of 905 MHz,
but again found that GPU output did not pass verifi-
cation for temperatures above 70◦C.
For (V, fclock, T ) = (875 mV, 905 MHz, 30
◦C), we
achieved 18.3 GFLOPS/W for the xGPU code. For com-
parison, the K20 default of (V, fclock, T ) = (950-1062.5
mV, 705 MHz, 30◦C) yields 13.5 GFLOPS/W for the
same code, degrading to 12.4 GFLOPS/W at 90◦C.
This means that by controlling GPU temperature, volt-
age, and clock frequency, we are able to increase performance-
per-watt by 37-48% over default settings.
4 Discussion
Our results show that temperature has a nontrivial im-
pact on GPU power efficiency. This is primarily due to
leakage current, which scales in proportion to T 2e−b/T .
Optimal power efficiency is achieved with lowest pos-
sible GPU supply voltage with highest possible clock
30 40 50 60 70 80 90
Temperature (C)
11.5
12.0
12.5
13.0
13.5
14.0
14.5
15.0
15.5
G
F
L
O
P
S
/W
V1
V2
V3
V4
V5
Fig. 3: xGPU power usage and efficiency on a K20 GPU,
with a core frequency of 800 MHz and varying voltage
levels (see Table 1). Here, T = 30◦C.
frequency at low temperature. We find efficiency can
be increased by as much as 48% on an NVIDIA K20
through this technique.
We have demonstrated a ∼30 W decrease on a GPU
power consumption of 154 W by changing GPU core
voltage state, a 20% reduction (Table 2). Coupled with
an increase in GPU clock frequency, performance in-
creased from 2064 GFLOPS to 2636 GFLOPS — a 28%
increase — while simultaneously power usage dropped
by 10 W. For large installations, even a small change in
power efficiency can have significant cost benefits.
We have presented results from a single GPU. In
actuality, the same chips from within the same pro-
cess will have a distribution of values (dynamic power,
leakage power, etc.), so a degree of conservatism is re-
quired in setting device parameters for mass produc-
tion. Allowing clock frequency and core voltage to be
set at run-time by the user, or adjusted automatically
using dynamic voltage and frequency scaling techniques
(DVFS), may provide a mechanism with which to boost
power efficiency over conservative defaults.
4.1 Application and temperature-aware DVFS
When a manufacturer chooses clock frequencies for a
GPU, it is typical to choose clock frequencies that can
support a wide range of workloads. For example, codes
such as DGEMM (double-precision general dense matrix
multiply) consume more power than the single-precision
xGPU code, but must still run within the TDP at de-
fault clock frequency. It follows that there will always
be a significant boost in clock frequencies possible for
applications that do not run close to the TDP limit.
One could imagine a control system that automatically
adjusts clock frequency and voltage, depending upon
application and desired performance optimization (e.g.
Optimizing performance-per-watt on GPUs in High Performance Computing 7
Table 2: Power efficiency (at 50◦C) and benchmarks for
xGPU code as a function of voltage and clock frequency..
Freq. Voltage Power Benchmark Power eff.
(MHz) (mV) (W) (GFLOPS) (GFLO-
PS/W)
1070 987.5 222.2 3094 13.9
1017 950.0 197.1 2940 14.9
978 925.0 186.9 2826 15.1
952 912.5 175.6 2750 15.7
926 987.5 199.0 2674 13.4
926 950.0 183.1 2674 14.6
926 925.0 179.4 2674 14.9
926 912.5 172.0 2674 15.5
926 900.0 168.5 2674 15.9
905 default 181.6 2636 14.5
905 875.0 144.4 2636 18.1a
805 default 167.5 2330 13.9
805 875.0 132.6 2330 17.5
705 default 153.9 2064 13.4
705 875.0 122.1 2064 16.9
aoutput from the GPU does not pass validation for T >70◦C.
FLOPS/W or FLOPS). This would be a form of DVFS.
Such an application-dependent frequency and voltage
scaling system (A-DVFS ) could offer a way to auto-
matically boost power efficiency and performance of
codes. This approach could also accommodate applica-
tions that require perfect load balancing or reduced sys-
tem jitter by setting clock frequencies uniformly across
all devices used by the application.
Indeed, the NVIDIA GPU Boost feature8, launched
with the K40 series GPU, allows users to select from
two preset higher clocks through nvidia-smi, boosting
performance for codes that run below TDP. GPU Boost
is implemented differently on the GeForce-class gaming
cards: core frequency is scaled to maintain card power
consumption close to TDP. Adding similar dynamic fre-
quency scaling functionality to server-class GPUs may
increase both power efficiency and performance for codes
with low power consumption.
Temperature and TDP limit the range of clock fre-
quencies and supply voltage combinations. The default
settings for GPUs are chosen specifically to ensure that
neither temperature or TDP tolerances are exceeded
for any application. In contrast, best power efficiency
occurs when voltages are lowered and clock frequency
raised in accordance with operating temperature. A
hypothetical temperature-aware voltage and frequency
scaling system (T-DVFS ) could raise and lower core
voltages automatically, based on the GPU die tempera-
ture. If cooling systems maintained lower temperatures,
the T-DVFS system would accordingly lower voltage,
increasing power efficiency.
8 http://www.nvidia.com/content/PDF/kepler/nvidia-
gpu-boost-tesla-k40-06767-001-v02.pdf
4.2 Cooling Systems
Our water-cooling system allowed us to operate the
GPU at lower die temperatures under load than that
possible with the stock fan. Overall power efficiency of
a GPU-based HPC system depends also on the power
consumed by cooling subsystems. Our water-based sys-
tem used less power than the chassis’ stock fans, so
in our simple case overall power efficiency increased.
In larger installations, Januszewskia et al. report that
water-based cooling systems can reduce the total power
consumed by a server room by more than 15% [7].
Warm water-based cooling techniques show great promise;
an IBM Aquasar system demonstrated an exergetic ef-
ficiency increase of 34% through use of warm water
(60◦C) cooling [23]. However, power consumption of
electronics increased by 7± 1% as the coolant temper-
ature increases from 30◦C to 60◦C.
If we modify Eq. 5 to include the power required
for cooling Pcool and other infrastructure sources, we
instead wish to optimize
ηpow(V, f, T ) =
NOPS(V, f, T )
Psys(V, f, T ) + Pcool(T ) + ...
, (7)
where the denominator is the sum of the power over the
entire system. Here, we have explicitly written Psys and
Pcool as functions of temperature. Using Eq. 7 as a ba-
sis for finding optimal power efficiency for a given code
differs from past techniques as it considers the system
as a whole, with regards to the fundamental physics
that governs power usage of the underlying microarchi-
tecture.
A novel aspect of Eq. 7 is that it predicts that lower-
ing temperature may lead to increased power efficiency,
which appears somewhat in conflict to previous find-
ings that report lower data center energy consumption
at higher temperatures. There are two main reasons
this discrepancy arises. Firstly, general-purpose data
centers focus on optimizing power usage effectiveness
(PUE), as opposed to performance-per-watt, which is
of more interest to HPC systems. PUE is defined as
the ratio of total facility energy (data center’s total en-
ergy usage) to IT equipment energy (sum of all comput-
ing, storage and network equipment energy usage). Un-
like performance-per-watt, PUE does not directly con-
sider the computational performance of a system. Sec-
ondly, previous comparisons between cooling methods
do not account for temperature-dependent optimization
of supply voltage and clock frequency.
5 Conclusions
One of the main challenges facing exascale HPC is dra-
matically reducing the power usage of large HPC sys-
8 D. C. Price et al.
tems. We have shown that temperature-aware optimiza-
tion of core clock frequency and supply voltage can in-
crease performance of a GPU code by up to 48% on an
NVIDIA Tesla K20, achieved by increasing the GPU
clock frequency and decreasing supply voltage while
maintaining a die temperature of 30◦C.
It is taken for granted that code must be optimized
for different architectures in order to fairly compare
compute performance. In contrast, when optimizing power
efficiency for HPC systems, the effect of temperature
upon optimal GPU core frequency and voltage is gener-
ally not considered. Temperature-aware and application-
dependent frequency and voltage scaling (T-DVFS and
A-DVFS) may provide a mechanism with which to in-
crease the power efficiency of GPUs for HPC, by auto-
matically tuning frequency and voltage with considera-
tion of both application code and thermal environment.
References
1. Broekema, P.C., van Nieuwpoort, R.V., Bal, H.E.:
ExaScale high performance computing in the
square kilometer array. In: workshop on High-
Performance Computing for Astronomy Data, p. 9.
ACM Press, New York, New York, USA (2012)
2. Clark, M.A., Plante, P.L., Greenhill, L.J.: Acceler-
ating radio astronomy cross-correlation with graph-
ics processing units. Int. J. of High Performance
Computing Applications 27(2), 178–192 (2013)
3. Collange, S., Defour, D., Tisserand, A.: Power
Consumption of GPUs from a Software Perspec-
tive. In: Computational Science–ICCS, pp. 914–
923. Springer, Berlin, Heidelberg (2009)
4. Ellingson, S.W., Taylor, G.B., Craig, J., et al.: The
LWA1 Radio Telescope. Antennas and Propaga-
tion, IEEE Trans. on 61(5), 2540–2549 (2013)
5. Ge, R., Vogt, R., Majumder, J., Alam, A., et al.:
Effects of Dynamic Voltage and Frequency Scaling
on a K20 GPU. In: Parallel Processing (ICPP),
42nd Int. Conf. on, pp. 826–833 (2013)
6. Hong, S., Kim, H.: An integrated GPU power and
performance model, vol. 38. ACM, New York, USA
(2010)
7. Januszewskia, R., Gillyb, L., Yilmazc, E., Auwe-
terd, A.: Cooling–making efficient choices. Tech.
rep., Partnership for Advanced Computing in Eu-
rope (2013)
8. Jiao, Y., Lin, H., Balaji, P., Feng, W.: Power and
Performance Characterization of Computational
Kernels on the GPU. In: Green Computing and
Communications (GreenCom), IEEE/ACM Int’l
Conf. on, pp. 221–228 (2010)
9. Kasichayanula, K., Terpstra, D., Luszczek, P., ,
et al.: Power Aware Computing on GPUs. Applica-
tion Accelerators in High Performance Computing
(SAAHPC), Symp. on pp. 64–73 (2012)
10. Kocz, J., Greenhill, L.J., Barsdell, B.R., et al.: A
Scalable Hybrid FPGA/GPU FX Correlator. J.
of Astronomical Instrumentation 03(01), 1450,002
(2014)
11. Leng, J., Zu, Y., Reddi, V.J.: Energy efficiency ben-
efits of reducing the voltage guardband on the ke-
pler gpu architecture. In: 10th IEEE Workshop on
Silicon Errors in Logic - System Effects (2014)
12. Liao, W., He, L.: Coupled Power and Thermal
Simulation with Active Cooling. In: Power-Aware
Computer Systems, pp. 148–163. Springer, Berlin,
Heidelberg (2005)
13. Liao, W., He, L., Lepak, K.M.: Temperature and
supply Voltage aware performance and power mod-
eling at microarchitecture level. Computer-Aided
Design of Integrated Circuits and Systems, IEEE
Trans. on 24(7), 1042–1053 (2005)
14. Liu, Y., Dick, R.P., Shang, L., Yang, H.: Accurate
Temperature-Dependent Integrated Circuit Leak-
age Power Estimation is Easy. In: Design, Automa-
tion & Test in Europe Conf. & Exhibition, DATE
’07, pp. 1–6 (2007)
15. Magro, A., Adami, K.Z., Ord, S.: Suitability
of NVIDIA GPUs for SKA1-Low. arXiv.org
1407.4698v3 (2014)
16. Mei, X., Yung, L.S., Zhao, K., Chu, X.: A measure-
ment study of GPU DVFS on energy conservation.
In: Workshop on Power-Aware Computing and Sys-
tems, pp. 1–5. ACM Press, New York, USA (2013)
17. Nagasaka, H., Maruyama, N., Nukada, A., et al.:
Statistical power modeling of GPU kernels using
performance counters. In: Int. Conf. on Green Com-
puting (Green Comp), pp. 115–122. IEEE (2010)
18. Nugteren, C., van den Braak, G.J., Corporaal, H.:
Roofline-aware DVFS for GPUs. In: Int. Workshop,
pp. 8–10. ACM Press, New York, New York, USA
(2014)
19. Perley, R., Napier, P., Jackson, J., Butler, B., et al.:
The Expanded Very Large Array. Proc. of the IEEE
97(8), 1448–1462 (2009)
20. Ren, D.Q., Suda, R.: Investigation on the power ef-
ficiency of multi-core and GPU Processing Element
in large scale SIMD computation with CUDA. In:
Int. Conf. on Green Computing (Green Comp), pp.
309–316. IEEE (2010)
21. Rofouei, M., Stathopoulos, T., Ryffel, S.: Energy-
aware high performance computing with graphic
processing units. In: Conf. on Power aware com-
puting and systems (2008)
Optimizing performance-per-watt on GPUs in High Performance Computing 9
22. Wootten, A., Thompson, A.: The Atacama Large
Millimeter/Submillimeter Array. Proc. of the IEEE
97(8), 1463–1471 (2009)
23. Zimmermann, S., Meijer, I., Tiwari, M.K., Paredes,
S., et al.: Aquasar: A hot water cooled data center
with direct energy reuse. Energy 43(1), 237–245
(2012)
