Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on
  GPUs for Real-time Edge Computing by Adámek, Karel et al.
Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on
GPUs for Real-time Edge Computing
Karel Ada´mek1,2, Jan Novotny´1,3, Jeyarajan Thiyagalingam4, and Wesley Armour∗1
1Oxford e-Research Centre, Department of Engineering Sciences, University of Oxford, 7 Keble
road, Oxford, OX1 3QG, United Kingdom
2Faculty of Information Technology, Czech Technical University, Tha´kurova 9, 160 00, Prague,
Czech Republic
3Research Centre for Theoretical Physics and Astrophysics, Institute of Physics, Silesian Univeristy
in Opava, Bezrucˇovo na´meˇst´ı 13, CZ-74601, Opava, Czech Republic
4Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Campus,
Didcot, OX11 0QX, UK
September 15, 2020
Abstract
The Square Kilometre Array (SKA) is an international
initiative for developing the world’s largest radio telescope
with a total collecting area of over a million square me-
ters. The scale of the operation, combined with the re-
mote location of the telescope, requires the use of energy-
efficient computational algorithms. This, along with the
extreme data rates that will be produced by the SKA and
the requirement for a real-time observing capability, neces-
sitates in-situ data processing in an edge style computing
solution. More generally, energy efficiency in the modern
computing landscape is becoming of paramount concern.
Whether it be the power budget that can limit some of
the world’s largest supercomputers, or the limited power
available to the smallest Internet-of-Things devices. In
this paper, we study the impact of hardware frequency
scaling on the energy consumption and execution time of
the Fast Fourier Transform (FFT) on NVIDIA GPUs us-
ing the cuFFT library. The FFT is used in many areas of
science and it is one of the key algorithms used in radio as-
tronomy data processing pipelines. Through the use of fre-
quency scaling, we show that we can lower the power con-
sumption of the NVIDIA V100 GPU when computing the
FFT by up to 60% compared to the boost clock frequency,
with less than a 10% increase in the execution time. Fur-
thermore, using one common core clock frequency for all
tested FFT lengths, we show on average a 50% reduction
in power consumption compared to the boost core clock
frequency with an increase in the execution time still be-
low 10%. We demonstrate how these results can be used
to lower the power consumption of existing data process-
ing pipelines. These savings, when considered over years
of operation, can yield significant financial savings, but
can also lead to a significant reduction of greenhouse gas
emissions.
∗E-mail address: wes.armour@oerc.ox.ac.uk
Keywords — Energy efficiency, Green computing,
High performance computing, Real-time systems, Paral-
lel architectures
1 Introduction
The Fast Fourier Transform (FFT) is one of the most fun-
damental and widely used numerical algorithms in scien-
tific computing, with applications in a diverse range of ar-
eas such as astronomy, image processing, audio and radar
signal processing, numerical solvers, such as partial differ-
ential solvers, and mechanical systems [8]. The FFT is also
an integral part of many data processing pipelines. For in-
stance, the FFT is an important part of data processing
pipelines in both image- [38, 29, 39, 14] and time-domain
[12, 4, 3, 22] radio astronomy.
The upcoming, next-generation radio telescope, the
Square Kilometer Array (SKA), will employ such complex
data processing pipelines to deliver science products that
will provide new and exciting insights into our Universe.
Previous studies, for example [11], estimate that the
SKA will require an exascale size high performance com-
puting (HPC) system to provide us with such scientific
products. Where, the computational footprint of the FFT,
depending on the data processing task, may occupy [20]
up to 47% of the overall computational footprint measured
in floating-point operations per second (or FLOPS). This
makes the FFT a critical algorithm for the SKA.
Processing the data captured by the SKA posses many
challenges. The SKA will produce extremely large vol-
umes of data at unprecedented rates. Furthermore, the
telescope itself must be located in a radio-quiet area due
to it’s extreme sensitivity. This makes the persistent stor-
age of all data not viable and transportation of these data
to a well equipped (and suitably powered) data centre im-
practical. Finally, some science cases such as the study of
Fast Radio Bursts (FRBs), necessitate near real-time data
1
ar
X
iv
:2
00
9.
06
00
9v
1 
 [c
s.P
F]
  1
3 S
ep
 20
20
processing. Meaning that data has to be processed close
to the instrument itself. These constraints present signifi-
cant challenges to software and system engineers, they de-
mand high fractions of peak performance of the hardware,
whilst maintaining the best possible energy efficiency of
both software and hardware.
To address the need to minimise the power consumption
of the locally installed hardware, close attention must be
paid to the energy efficiency of the data processing algo-
rithms, specifically the FFT. Given the emphasis on lower
power consumption in HPC in general, the ability to com-
pute the FFT more efficiently is of interest to many com-
putational domains.
The near real-time processing constraint means that the
execution time of the data processing algorithms must not
be increased significantly. An increase in the execution
time might lead to either failure to process data on time
and hence a loss of scientifically important data or in-
creased capital and operational costs as more hardware
would be needed to meet the real-time requirement.
Motivated by this, we have studied the impact of dy-
namic frequency scaling (DFS) on the energy efficiency
and execution time of the FFT on NVIDIA GPUs using
the cuFFT library [27]. The GPU is the fastest and most
energy efficient choice of hardware for image domain ra-
dio astronomy as shown by [40], with FPGAs a close sec-
ond. There are other FFT libraries for GPU’s, notably, the
clFFT library which uses the OpenCL framework. clFFT
is not a vendor supported library and was shown by [34]
to be slower than cuFFT on NVIDIA GPUs thus we have
not considered it for this work.
Our exhaustive study, conducted on a range of state-of-
the-art GPUs shows that careful tuning of the core clock
frequency can save, in the case of the V100 GPU, up to
60% (boost core clock frequency) of the energy consump-
tion of the FFT. This saving can have a significant impact
on two fronts: financial savings in recurrent costs, and the
associated reduced CO2 emission. We also show that these
carefully tuned frequencies can be replaced with a single
frequency that is specific to each model of GPU and cho-
sen floating-point precision, whilst still being able to save
on average up to 50% of the FFT energy consumption (for
the V100 GPU and boost core clock frequency).
The main contributions of this work are:
• We have performed an in-depth investigation of
cuFFT library’s power consumption and execution
time and how it changes with core clock frequency for
a wide range of problem sizes and numerical precisions
(FP16, FP32 and FP64) on five NVIDIA GPUs.
• We identify an optimal core clock frequency with the
highest energy efficiency for all problem sizes and nu-
merical precisions and have shown that a single mean
optimal frequency per GPU model gives similar power
savings regardless of problem size.
• We demonstrate how these results can be used to
lower the power consumption of existing data pro-
cessing pipelines.
Whilst this work has been motivated by the SKA radio
telescope, the conclusions of the work are applicable to
any computational task that employs cuFFT running on
NVIDIA GPUs.
2 Background
Power consumption in HPC is being solved on multiple
levels. From construction at the level of the cluster to
new energy efficient hardware. The power consumption
of specific hardware depends on execution time, the time
taken to finish a calculation, and also on the utilization
of the hardware (memory, cache, computing cores). The
software itself also plays an important role in power con-
sumption. Energy can be saved through proper software
design, making software stable [25] and through the use of
appropriate algorithms.
However, concerns regarding energy efficiency in the
modern computing landscape are not solely limited to
HPC. Edge computing is becoming an increasingly im-
portant research area driven by the explosion of Internet-
of-Things devices. The basic premise of edge computing
is to capture and process data as close to their sources as
is possible by utilising light weight processors. Because
edge computing aims to process data locally, it minimizes
wider latency and bandwidth needs and allows for real-
time feedback. It is estimated that by 2025 around 150
billion devices will be connected and creating data in real-
time [31], with the FFT playing, not only an important
role in the communication between devices, but also in
processing collected data. Hence optimising the energy ef-
ficiency of the FFT on edge devices is of importance from
an environmental perspective. This has motivated us to
include NVIDIA’s Jetson Nano in our selection of hard-
ware since it represents NVIDIA’s low power edge com-
puting solution.
The idea behind DFS, which is part of the dynamic
voltage and frequency scaling (DVFS) method, is to make
hardware more energy efficient under different loads by ad-
justing hardware performance which is achieved by chang-
ing clock frequencies to fit the application running on it.
By decreasing the clock frequency of a component we de-
crease its performance while increasing its utilization and
thus decreasing the power consumption of a given compo-
nent. For example, Trefethen et. al. [37] have investigated
possible energy savings when running software on CPUs
with a different number of threads, compilers and CPU
clock frequencies.
Applications can be broadly separated into two classes
of performance, the first is where an application or algo-
rithm is compute-bound. This is where the performance
bottleneck of the application is the compute resource. This
can be the number of floating-point operations which can
be performed per second (FLOPS), but also the number of
instructions which can be issued per second. The second
broad category is memory bandwidth bound applications,
where we have enough compute resources but we cannot
supply the data through the memory bus to the comput-
ing cores quickly enough. In this case the performance is
then limited by the memory bandwidth. This bandwidth
limitation can occur at any level in the computers memory
hierarchy, for example this might be at the level of access
2
to the GPU main memory (called device memory), or at
the level of one of the caches.
We have investigated the cuFFT library using the
NVIDIA Visual Profiler (NVVP). This shows that for
all investigated problem sizes GPU kernels used by the
cuFFT library are device memory bandwidth bound.
2.1 FFT algorithm
The one-dimensional discrete Fourier transformation
(DFT) of a signal x is given by
Xl =
N−1∑
n=0
xn exp
[
−i2pinl
N
]
, (1)
where Xl is the l-th element of a transformed signal, xn
is the n-th element of an input signal, and N is the trans-
formation length or the FFT length.
The cuFFT library [27] uses the Cooley-Tukey algo-
rithm [17] for FFT sizes that can be decomposed as multi-
ples of powers of primes from 2 to 127 and Bluestein’s al-
gorithm [6] otherwise. For longer FFT lengths the cuFFT
library uses multiple GPU kernels to compute the entire
FFT, which can be seen by studying the cuFFT library
using the NVVP. In many cases, the Fourier transform is
calculated more quickly if the FFT length is increased by
padding to a more optimized length as was shown by [35].
The two-dimensional Fourier transformation is given by
the formula
Xl,k =
M−1∑
m=0
N−1∑
n=0
xn,m exp
[
−i2pi
(
nl
N
+
mk
M
)]
, (2)
where xn,m, Xl,k is now an element of a matrix of size
N × M . The sums in this equation can be evaluated
independently which allows us to decompose the two-
dimensional Fourier transformation into two sets of one-
dimensional Fourier transformations. This is routinely
done and it is indeed what cuFFT does when calculating
higher-dimensional (2D, 3D) Fourier transformations as
shown by the NVVP. Thus by investigating the energy ef-
ficiency of the one-dimensional Fourier transformation we
are also investigating the energy efficiency of the higher-
dimensional Fourier transforms.
2.2 GPU architecture
The GPU design methodology is different to that of a
CPU. A CPU architecture is aimed at low latency compu-
tations, but also has lower throughput. In other words, the
CPU can execute a wider range of complicated algorithms
quickly, for example a complicated branching code, but the
number of concurrently running tasks is small. A GPU ar-
chitecture has high latency but also high throughput, on a
GPU one can execute thousands of simple tasks but each
task takes longer to process due to the simpler schedulers
that are employed. Both platforms are broadening their
focus, CPUs are adding more cores and increasing their
vector lengths as GPU architectures become more com-
plex and GPU schedulers become more sophisticated.
Device memory
L2 cache
Computing block
Memory block
L1 cache L1 cache L1 cache
SM SM SM
Figure 1: A schematic of the GPU architecture.
The GPU architecture, which is, in simplified form,
shown in Fig. 1, is divided into the memory block and
the compute block. The compute block is further divided
into caches and streaming multiprocessors (SM) which
are responsible for executing the computations. The SMs
are further divided into specialized units such as floating-
point cores or special function units (which are respon-
sible for computing things like transcendental functions).
The memory hierarchy on the GPU is distributed between
these two blocks. The device memory which runs at the
memory clock frequency has the lowest bandwidth on the
GPU card and it is the memory that the CPU (host) can
read/write into via the PCIe bus. The L2 cache is shared
between the SMs, the L1 cache is private to each SM and
the shared memory is shared amongst a group of threads
called a threadblock. The L2, L1 and shared memory
bandwidth is proportional to the core clock frequency, thus
by using a lower core clock frequency we also decrease the
bandwidth of these caches. The core clock frequency, as
well as the memory clock frequency, can only be set to
predefined values.
Different GPUs may use different memory modules.
Amongst the tested GPUs were GPUs with GDDR mem-
ory modules (Titan XP, P4, Jetson Nano) which allow
us to change the memory clock frequency, but also GPUs
with HBM2 modules (Titan V, V100) which do not allow
us to change the memory clock frequency.
When measuring the power consumption and perfor-
mance of the GPU it is important to keep the GPU
utilized. For example, the NVIDIA V100 GPU has 80
streaming multiprocessors (SM) where each SM is able to
3
run up to 2048 threads. This gives more than 150 thou-
sands threads which can execute concurrently. Thus in
our measurements, we have used a fixed amount of data
containing a different number of individual Fourier trans-
forms to keep the GPU utilized for all tested FFT lengths.
2.3 Real-time processing
Data processing can be composed of a single step but more
often is a series of processing steps which together form a
data processing pipeline.
The ability of the application to process data in a real-
time processing scenario can be described by the real-time
speed-up factor. The real-time speed-up is calculated as
S = ta/tp, where ta is the time needed to acquire a given
amount of data by the telescope, sensor, etc. and tp is the
time taken to process that data. When S ≥ 1 the pipeline
is processing data in real-time or quicker and when S < 1
the pipeline is not managing to process data in real-time.
If we assume that our toy pipeline has a real-time speed-up
factor of S = 1 that pipeline is processing the data in time
but has no performance buffer to call on if needed. In such
a case any increase in the execution time leads to S < 1
and in order to process data in real-time again we must
add more hardware to share the processing load. This
situation is however unrealistic and a real-world pipeline
would have a performance buffer to call on in the case of an
unexpected event. We must also keep in mind that an in-
crease in hardware does not necessarily equate to the same
increase in a pipelines performance. The parallelization of
a given task might be non-trivial, for example, communi-
cation between GPUs could be a limiting factor. In our
case this approximation is appropriate as Fourier trans-
formations which can fit into the memory of the GPU can
be easily distributed amongst the GPUs.
In this work we consider two situations. The first is
where the real-time processing pipeline exists and where
the spare performance can be used to increase the energy
efficiency of the pipeline. In the second case, we are in-
terested in how much additional hardware is needed to
process data in real-time at the best energy efficiency.
3 Related Work
As of November 2019, the first two positions in the top
500 list of supercomputers are held by systems that use
GPUs. Within the top ten, five systems contained GPUs.
In the Green 500 list, GPUs are used in eight out of the
top ten supercomputers. A clear demonstration that it
is important to understand the power consumption, en-
ergy efficiency and potential energy savings for GPUs us-
ing DVFS.
The different approaches of how to measure the power
consumption, power and performance modelling and also
the results of DVFS for selected applications were reviewed
by Mei el al.[24]. The authors note that the effect of
DVFS depends not only on the architecture but also on
the characteristics of the GPU application. They have
found the optimal frequency for 42 GPU applications and
found that 12 of them benefited from an increased core
frequency compared to the default whereas for 30 appli-
cations the optimal frequency was lower than the default
core frequency, and values of these optimal frequencies
were different for most GPU applications. The authors
called for a deeper investigation into their differences. A
useful review of the DVFS technique is provided by Mittal
and Vetter [26]. The review by Bridges et al. [7] looked
into the modelling of the power consumption by GPUs.
A number of published studies have investigated the
reliability of power measurements using internal sensors.
Burtscher et al. [9] published their experience of using
built-in sensors when measuring the power consumption
of NVIDIA K20 GPUs. They described several issues that
they encountered when using these sensors and suggested
methods to correct for these. The accuracy of the built-in
sensors was investigated by Farad et al. [13] who found
that the average mean error using an abstract model of
a GPU is about 10% compared to measurements using
external power meters. This error value was confirmed
by Arafa et al. [5] who measured the energy consump-
tion of almost all PTX instructions for four generations
of NVIDIA GPUs. They have found that the Maxwell
and the Turing generations of GPUs have high energy
consumption when compared to the Pascal and the Volta
generations of NVIDIA GPUs which are found to be more
energy efficient.
There are a number of papers where authors have used
DVFS in the context of GPUs [2, 33, 41, 23, 15, 10, 21,
16, 18, 24, 36]. Guerreiro et. al. [16] classified GPU appli-
cations into four different categories which describe their
behaviour when DVFS is applied. These categories are
an extension of the compute-bound, memory-bound dis-
tinction. The early work on GPU power consumption and
DVFS was performed by Jiao et al. [18]. They investi-
gated the behaviour of several GPU applications which
included the FFT algorithm, however, the cuFFT library
was not studied because there were better performing FFT
implementations at the time. The FFT was also indirectly
included in Mei et al. [24] as part of the convolution, and
in Tang et al. [36] where the author investigated the effect
of DVFS on deep learning applications.
In relation to radio astronomy and the SKA, there are
several works. Price et al. [30] made a detailed inves-
tigation into power consumption, voltage and frequency
scaling of the GPU implementation of the correlator for
the SKA. The power consumed by the GPU in the do-
main of radio astronomy was investigated by Romein [32].
The performance of the cuFFT library was investigated
by Jondra et al. [19] along with its power consumption.
However, increases in energy efficiency due to frequency
scaling were not investigated.
4 Experimental Setup and Evalu-
ation
The code that we have used1 for measurements of the en-
ergy efficiency of the FFT algorithm consists of a basic
1can be found at https://github.com/KAdamek/cuFFT_
benchmark
4
implementation of the NVIDIA cuFFT library [27].
The code first generates input data as pseudo random
numbers on the host and then we transfer the data from
the host to the device via the PCIe bus. The code runs
the FFT algorithm on the GPU multiple times whilst the
power used by the GPU is measured as described below.
The measurements gained from multiple runs are used to
calculate a relative standard deviation which we use to
represent the measurement error in the results presented.
We provided the GPU with enough data to ensure that it is
fully utilized. The Fourier transform used was an out-of-
place one-dimensional transform as provided by cuFFT.
When the FFT algorithm ends the measurement of the
power is stopped. Thus only the power consumption of
the FFT algorithm on the GPU is measured. The calcu-
lated result is transferred back to the host. The result is
then compared to the result from the same transforma-
tion again performed by the GPU, but this time using the
GPU’s default settings. This is done to ensure correctness.
To technically achieve the above scenario we log
the timestamp, power consumption, current core clock
frequency and current memory clock frequency. For
that we use the NVIDIA System Management Interface
(nvidia-smi) for all GPU cards except the Jetson Nano,
where we have used the tegrastats utility. For both we
have specified the measurement interval to be 10 ms as our
tests have showed that a setting of time sampling below
10 ms does not lead to an improvement in the time resolu-
tion of our data. The actual time between samples varied
and the actual achieved sampling rate from the driver is
on average 14.2 ms for all tested FFT lengths and cards.
This sampling rate fulfills the criterion of at least 15 ms
(66.7 Hz) recommended by Burtscher et al. [9] to accu-
rately measure the energy consumption of real-world ker-
nels.
For the localization of the FFT algorithm and estab-
lishing the execution time we have used the nvprof utility
where we have included the timestamp. Finally we log
the beginning and end of each GPU kernel execution to
a file. This way we produce two files containing all of
the needed metrics for all possible combinations of core
clock frequencies for a specific FFT length, bit precision
and GPU card. The final combination (via the times-
tamp comparison) of these files is done by using a simple
R script. Here we compute all other metrics including
energy efficiency, optimal clock frequency, mean optimal
core clock frequency and computational performance. In
the script we also verify that the current core clock fre-
quency is the same as the requested one, and compare the
measured execution time from nvprof with the log times-
tamps of the nvidia-smi query. Using this method we
have found that, for the Titan V, the core clock frequency
is capped to 1335 MHz by the driver2 during the compu-
tation, but during the copy of the results is set to a higher
core clock frequency (1837 MHz). For frequencies lower
than 1335 MHz, no capping is observed. An example of
the GPU kernel power consumption and active core clock
frequency, which was localized using log file timestamps,
is shown for the V100 GPU in Fig. 2 (top). An example
2driver version 450.36.06
of the frequency capping on the Titan V GPU is shown in
Fig. 2 (bottom).
Tesla V100, FFT length= 16384,
core clock frequency= 1020 MHz
Po
w
er
 c
on
su
m
pt
io
n 
[W
]
Co
re
 fr
eq
ue
nc
y 
[M
H
z]
Sample index
power consumption
core frequency
 40
 60
 80
 100
 120
 140
 0  50  100  150  200  250
 1005
 1010
 1015
 1020
 1025
 1030
 1035
Titan V, FFT length= 16384,
core frequency= 1020 MHz
Po
w
er
 c
on
su
m
pt
io
n 
[W
]
Co
re
 fr
eq
ue
nc
y 
[M
H
z]
Sample index
power consumption
core frequency
 0
 20
 40
 60
 80
 100
 120
 140
 160
 0  100  200  300  400  500  600
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 2000
Figure 2: Parts of the log file with the GPU kernel high-
lighted (red dots) by the R script between the two non-
computing parts of the GPU run (grey line dots) show-
ing the reported power consumption. The blue line corre-
sponds to the measured core clock frequency. Specifically,
the data displayed are from measurements on the Tesla
V100 (top) and Titan V (bottom) for an FFT length of
214, single precision and the core clock frequency set to
1020 MHz (Tesla V100) and 1912 MHz (Titan V).
The choice of clock frequencies for both the memory
bus and the computational cores are limited to a set of
supported frequencies defined by the hardware itself. The
supported core clock frequency can easily be changed by
the driver API. The allowed clock frequencies of the device
memory bus are limited or not changeable depending on
the memory type. Since the cuFFT library is completely
limited by device memory bandwidth this suggests that
lowering the memory frequency would not lead to sub-
stantial increases in the energy efficiency. Thus, we have
not changed the memory clock frequency in this work.
Moreover the High Bandwidth Memory (HBM) which is
present on the newest GPU cards (Titan V, Tesla V100)
operates on a fixed memory clock frequency. The ranges
and step sizes of the core clock frequencies that we have
used are summarized in Table 1.
The energy for a specific core clock frequency is defined
as
Ef =
∑
i
Pi · ti , (3)
where Pi corresponds to the reported power for a sample
index i and ti is the time between the current sample and
5
Table 1: List of the allowed core clock frequencies from
maximal fmax up to minimal fmin frequency for all cards
and their corresponding frequency step size (fstep). The
size of the frequency step alternates between values shown
in the column fstep with the exception of the Jetson Nano.
Card name fmax [MHz] fmin [MHz] fstep [MHz]
Tesla V100 1530 135 7, 8
Tesla P4 1531 455 12, 13
Titan XP 1911 379 12, 13
Titan V 1912 135 7, 8
Jetson Nano 921.6 76.8 76.8
the previous one. Then the energy efficiency for a specific
core clock frequency is given as
Eef = Cp · t/Ef , (4)
where t corresponds to the time of the whole run of the
computation, Ef is the energy and Cp is the computa-
tional performance in FLOPS given by
Cp = [5N log2(N) ·Nb ·NFFT] /t , (5)
where Nb is the number of FFT runs of length N and
NFFT is the number of FFTs computed per run. The
number of Fourier transforms performed (NFFT) depends
on the FFT size as follows
NFFT = MGB/(N · B) , (6)
where MGB is the desired amount of memory used for
FFTs in GB and B is the byte size of the input data
type. The optimal core clock frequency for a specific FFT
length is then found as the one with the minimal consumed
energy.
We define the increase in energy efficiency as
Ief = Eef,o/Eef,d , (7)
where Eef,o and Eef,d are the energy efficiencies for the
optimal frequency and the boost frequency respectively
(given by (4)).
The measurement error, that is the relative standard
deviation, for the V100 GPU and the Jetson Nano is shown
in Fig. 3. We have observed that the measurement error,
in general, is around 5% for all cards except the Jetson
Nano. The GPU cards use instrumentation amplifiers for
the current/voltage/power monitors, hence the potential
error in the measurement is expected to be around 3–5%
[1]. The results of our power measurement correspond to
the expected characteristics of the on-board chips.
For Fourier transformations of higher radices (7+) or
for Fourier transformations which use the Bluestein algo-
rithm we observe a measurement error of up to 5%. The
measurement error increases with decreasing core clock
frequency and increasing number of GPU kernels used for
the FFT calculation.
The measurement error for the Jetson Nano is usu-
ally below 15% for all FFT lengths, and is below 10%
for power-of-two FFT lengths. The highest measure-
ment error that we have observed is for Bluestein FFT
Po
w
er
 m
ea
su
re
m
en
t e
rr
or
 [%
]
Measurement number
Tesla V100, FP32
Radix n=2
Radix n>2
Bluestein
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
 0  2000  4000  6000  8000  10000  12000
Po
w
er
 m
ea
su
re
m
en
t e
rr
or
 [%
]
Measurement number
Jetson Nano, FP32
Radix n=2
Radix n>2
Bluestein
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
 55
 0  100  200  300  400  500  600  700
Figure 3: Measurement error (V100 GPU at the top, Jet-
son Nano at the bottom) for all tested FFT lengths at all
tested core clock frequencies.
lengths. For these lengths, cuFFT uses multiple kernels
(for N = 1392 eleven GPU kernels are used) thus the high
measurement error is due to the different loads these GPU
kernels exert on the GPU and also the differing power con-
sumption between them. The Bluestein FFT lengths rep-
resent a marginal case. Due to large measurement errors
for Bluestein FFT lengths on the Jetson Nano we have not
included these measurements into our calculations of mean
optimal frequency. However, we present these results for
the sake of completeness.
For the measurement of the execution time we have used
the NVIDIA Visual Profiler. Using this we have found that
the measurement error for the execution time was below
0.3%.
Using propagation of uncertainty the error of the en-
ergy (3) is dominated by the measurement error of the
power consumption. Based on that, the error in the in-
crease in energy efficiency (7) is given by
σR(Ief) =
√
2σR(Eef) , (8)
where σR is the relative error and we have assumed that
the relative error in Eef,o and Eef,d are equal. This gives
an error for the increase in the energy efficiency of 7% for
all GPUs except the Jetson Nano where the error is 21%.
These values represent the worst case scenario since most
of measurement errors are well below these values.
6
5 Results
For our investigation, we have used five different NVIDIA
GPUs from three recent architecture generations, namely
V100 (Volta), Tesla P4 (Pascal), Jetson Nano (Maxwell),
Titan V (Volta) and Titan XP (Pascal). The relevant
hardware specifications can be found in Table 2. Both
the V100 GPU, and Tesla P4 GPU are aimed at scien-
tific applications, the P4 GPU also offers improved energy
efficiency for it’s generation. The Jetson Nano is a low-
powered all-in-one solution for autonomous systems. The
Titan V and Titan XP are consumer grade GPUs.
GPUs have two different frequency settings: a base and
a boost core clock frequency. If not stated otherwise, we
have used the boost core clock frequencies. This is because
the GPU’s default behaviour is to perform calculations at
the boost core clock frequency. This is indeed what is
observed when the GPU is set to default mode and we run
our cuFFT code. When reporting energy efficiency, we use
both frequencies as there is a non-linear dependency of the
power consumption of a GPU on the core clock frequency.
We have measured the complex-to-complex (C2C) one-
dimensional transform for three different floating-point
precisions; double (FP64), float (FP32) and half (FP16).
The Tesla P4, Titan XP and Jetson Nano GPUs have lim-
ited support for the double precision format. Furthermore,
the Tesla P4 and the Titan XP do not support the half
(FP16) floating-point precision. In addition, when using
half precision (FP16), the cuFFT library supports only
power-of-two FFT lengths.
We have investigated various FFT lengths but focused
on lengths that are powers-of-two because FFT algo-
rithms are not only best suited to processing such lengths,
but also offer superior execution time performance with
powers-of-two lengths. When calculating non-power-of-
two FFT lengths it is often faster [35] to pad the data
which needs to be Fourier transformed to the nearest
higher power-of-two FFT length and then Fourier trans-
form.
First, we present execution times for processing a fixed
amount of data tfix which offers an insight into the level of
optimization provided by the cuFFT library. The mem-
ory requirements to store the data needed for the Fourier
transform grows linearly with the FFT length N . Since
the cuFFT library is limited by the device memory band-
width, the execution time consists of the time required to
transfer the data to computing cores and to store the re-
sult back to the device memory ti, and the time required
for any additional overhead accesses to the device mem-
ory to. If the performance limiting factor is different to
the device memory bandwidth, we are unable to make
such a distinction in this work. In an ideal case where
we would have a large enough cache, the execution time
of the Fourier transform would be equal to the time ti.
However, because the cache size is limited, the time to
will be non-zero and directly indicate the efficiency of the
implementation. By fixing the amount of memory being
processed, the time ti will be a constant and any increase
in the execution time of the Fourier transform will be due
to time to.
If we fix the amount of data that is processed then the
number of FFTs performed NFFT depends on the FFT
length as given by (6). The execution time of a single
FFT within a batch is given as tt = tfix/NFFT. The exe-
cution time tfix for processing a fixed amount of data for
various FFT lengths is shown in Fig. 4 for FP32 and in
Fig. 5 for FP16 and FP64 precision. The execution time
for the Jetson Nano is for 1/4 of the amount of data so
the comparable value of tfix is tfix = 4tˆfix. This is due to
the low amount of available memory on the card.
The execution time tfix increases in proportion to the
length of the Fourier transform. However, we see regions
of the same execution time with sudden increases after
specific FFT lengths. These abrupt changes represent a
transition from one optimized GPU kernel to another as
is shown by the NVIDIA profiler. We must take these
changes into account in our analysis since these GPU ker-
nels might behave differently. When the execution time
tfix does not increase for a given range of problem sizes (for
example from FFT length N = 32 to N = 8192) it means
that the higher number of floating-point operations which
come with a larger problem size utilizes GPU resources
other than the device memory bandwidth. Given that the
Titan XP, Tesla P4 and Jetson GPUs do not fully sup-
port all tested floating-point precisions the execution time
of Fourier transformations on these GPUs exhibit different
behaviours.
Radix n=2
Radix n>2
Bluestein
Av
er
ag
e 
tim
e 
t fix
 [m
s]
FFT length [samples]
Tesla P4
Titan XP
Titan V
Tesla V100
Jetson Nano (t^ fix)
4
101
102
103
104
32 256 2k 16k 128k 1M
complex-to-complex, FP32
Figure 4: The execution time tfix (for FP32) required to
process a fixed amount of data for different FFT lengths.
The discontinuities in the execution time indicate a change
of optimised GPU kernel that is used to calculate the FFT.
Results for the Jetson Nano are for one quarter of the
memory size.
In this work, results are presented per FFT batch, which
is the number of FFT’s of a given length which fit into
the fixed amount of memory that we have chosen to work
with. However, most of our results, such as energy effi-
ciency, are independent of the number of FFTs calculated
provided that the GPU is fully utilised. The execution
time for different core clock frequencies is denoted by tf .
The execution time with boost frequency is denoted as td
and is taken as the execution time for the default settings.
Furthermore, we have focused our discussion on the V100
GPU as it is the most current (and widely used) scien-
tific GPU and the Jetson Nano as it represents NVIDIA’s
low power edge computing solution. We point out any de-
7
Table 2: GPU card specifications. The shared memory bandwidth is calculated as BW(bytes/s) =
(bank bandwidth (bytes))× (clock frequency (Hz))× (32 banks)× (# multiprocessors).
Titan XP Tesla P4 Titan V Tesla V100 Jetson Nano
CUDA Cores 3840 2560 5120 5120 128
SMs 30 20 80 80 2
Base/Boost Core Clock 1405/1480 MHz 810/1063 MHz 1220/1455 MHz 1200/1455 MHz 921 MHz
Memory Clock 5005 MHz 3003 MHz 850 MHz 877 MHz 1600 MHz
Dv. m. bandwidth 547 GB/s 192 GB/s 652 GB/s 900 GB/s 25.6 GB/s
Memory modules GDDR5 GDDR5 HBM2 HBM2 LPDDR4
Shared m. bandwidth 5395 GB/s 2657 GB/s 14550 GB/s 14550 GB/s 230 GB/s
Memory size 12 GB 8 GB 12 GB 16 GB 4 GB
TDP 250 W 75 W 250 W 300 W 5/10 W
CUDA version 10.0.130 10.0.130 10.0.130 10.0.130 JetPack 4.2 SDK
Radix n=2
Radix n>2
Bluestein
Av
er
ag
e 
tim
e 
t fix
 [m
s]
FFT length [samples]
4
101
102
103
104
105
32 256 2k 16k 128k 1M
complex-to-complex, FP64
Titan V
Tesla P4
Titan XP
Tesla V100
Jetson Nano (t^ fix)
32 256 2k 16k 128k 1M
complex-to-complex, FP16
Figure 5: The execution time tfix (for FP16 and FP64) re-
quired to process a fixed amount of data for different FFT
lengths. The discontinuities in execution time indicate a
change of optimised GPU kernel that is used to calculate
the FFT. Results for the Jetson Nano are for one quarter
of the memory size.
viations from these behaviours in the other tested GPUs
when they occur.
5.1 Frequency Scaling
First, we present the behaviour of the execution time with
changing core clock frequency. This is shown as a ratio of
execution time tf over default execution time td in Fig. 6,
which shows all tested configurations for FP32 precision.
There are three distinct behaviours, the execution time
is:
a) decreasing at first;
b) slightly increasing;
c) increasing notably with each frequency decrease.
In the case of the V100 GPU, the first two behaviours
a) and b) are in the majority. For a few specific FFT
lengths (notably for N = 8192) we have observed be-
haviour c). We have observed this behaviour through-
out multiple measurements and always for the same FFT
complex-to-complex, FP32
Jetson Nano
mean opt. core clock freq.
1
10
Ex
ec
ut
io
n 
tim
e 
ra
tio
GPU core clock frequency [MHz]
complex-to-complex, FP32
Tesla V100
1
10
200400600800100012001400
Ex
ec
ut
io
n 
tim
e 
ra
tio
Figure 6: Ratio of the execution time tf over the default
execution time td measured for the V100 GPU and the
Jetson Nano. Every investigated FFT length is shown
and represented by a single line.
lengths. Other tested GPUs behaved similarly to the V100
GPU.
The Jetson Nano exhibits a different behaviour, where
most of the configurations belong to case c) with notable
peaks which are present for Bluestein FFT lengths.
The energy consumed per FFT batch calculated by
equation (3) with fixed length N = 16384 for different
GPUs is shown in Fig. 7. For the measurement, we have
used a batch of 16384 FFTs (in the case of FP32 this rep-
resents 2 GB of input data) in order to fully saturate the
GPU. Notably, the energy per FFT batch on the Titan V
GPU does not change above 1335 MHz. This is because
the card does not run at the user selected frequency but
is capped by the driver to 1335 MHz.
As the core clock frequency decreases the power con-
sumption of 0the GPU changes non-linearly. This is shown
in Fig. 8 for the V100 GPU and Jetson Nano.
The frequency at which the energy per FFT batch
reaches a minimum was selected as the optimal frequency.
The optimal frequency is different for each tested FFT
length for a given GPU and precision. The optimal fre-
quency expressed as a percentage of the default core clock
frequency for all precisions is shown in Fig. 9.
8
En
er
gy
 p
er
 F
FT
 b
at
ch
 [J
]
GPU core clock frequency [MHz]
complex-to-complex, FP32, FFT length=16384
Titan XP
Tesla P4
Titan V
Tesla V100
Jetson Nano
1
2
3
4
5
6
20040060080010001200140016001800
Figure 7: The energy consumed per FFT batch changes
with core clock frequency. The minimum, emphasized by
a black star for each tested GPU, represents the most effi-
cient configuration and the value of the optimal frequency.
complex-to-complex, FP32
Jetson Nano
mean opt. core clock freq.
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Av
er
ag
e 
po
w
er
 c
on
su
m
pt
io
n 
[w
]
GPU core clock frequency [MHz]
complex-to-complex, FP32
Tesla V100
50
100
150
200
240
200400600800100012001400
Av
er
ag
e 
po
w
er
 c
on
su
m
pt
io
n 
[w
]
Figure 8: Averaged power consumption as a function of
core clock frequency for all tested FFT lengths. The Jet-
son Nano is shown independently as its behaviour is differ-
ent from the rest of the tested GPUs which are represented
by the V100 GPU.
Pe
rc
en
ta
ge
 o
f b
oo
st 
G
PU
 c
or
e 
cl
oc
k 
fre
qu
en
cy
 [%
]
Titan V
Tesla V100
30
40
50
60
70
80
32 256 2k 16k 128k 1M
FP32
FFT length [samples]
32 256 2k 16k 128k 1M
FP64
Jetson Nano
Tesla P4
Titan XP
32 256 2k 16k 128k 1M
FP16
Figure 9: Value of the optimal frequency expressed as
a percentage of the boost clock frequency. The value of
the optimal frequency is consistent through different pre-
cisions with the exception of the Tesla P4 GPU.
5.2 Energy savings
To acquire the following results we have selected the op-
timal frequency for each FFT length and measured the
consumed power to calculate the energy efficiency using
equation (4). The energy efficiency expressed as the num-
ber of GFLOPS/W is shown in Fig. 10.
complex-to-complex, FP32
FFT length [samples]
En
er
gy
 e
ffic
ie
nc
y 
[G
FL
O
PS
/W
]
Jetson nano
Tesla V100
Tesla P4
Titan XP
Titan V
0
5
10
15
20
25
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
32 256 2k 16k 128k 1M
complex-to-complex, FP64
FFT length [samples]
En
er
gy
 e
ffic
ie
nc
y 
[G
FL
O
PS
/W
]
Jetson nano
Tesla V100
Tesla P4
Titan XP
Titan V
0
2
4
6
8
10
12
14
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
32 256 2k 16k 128k 1M
En
er
gy
 e
ffic
ie
nc
y 
[G
FL
O
PS
/W
]
FFT length [samples]
Jetson nano
Tesla V100
Titan V
0
10
20
30
40
50
32 256 2k 16k 128k 1M
complex-to-complex, FP16
Figure 10: Floating-point operations per second per Watt
(GFLOPS/W) for optimal frequency. The coloured region
shows the improvement from the default frequency.
The change in the execution time for the optimal fre-
quency with respect to the default execution time as a
percentage is shown in Fig. 11. The change in GFLOPS is
shown in Fig. 12. The peaks visible in Fig. 11 correspond
to FFT lengths which displayed case c) type behaviour of
the execution time (Fig. 6).
The increase in the energy efficiency (7) with respect
to the boost core clock frequency is shown for different
precisions in Fig. 13 and with respect to the base core
clock frequency in Fig. 14.
We see that the optimal frequency of different FFT
lengths as shown in Fig. 9 is roughly the same for a given
GPU and precision across all tested FFT lengths. Further-
9
D
iffe
re
nc
e 
of
 ex
ec
ut
io
n 
tim
e 
[%
]
FFT length [samples]
Tesla P4
Titan XP
Titan V
Tesla V100
-4
0
5
10
15
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
Jetson Nano
60
80
100
120
140
complex-to-complex, FP32
D
iffe
re
nc
e 
of
 ex
ec
ut
io
n 
tim
e 
[%
]
FFT length [samples]
Tesla P4
Titan V
Tesla V100
-5
0
10
20
30
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
Jetson Nano
Titan XP
40
60
80
100
120
140
complex-to-complex, FP64
D
iffe
re
nc
e 
of
 ex
ec
ut
io
n 
tim
e 
[%
]
FFT length [samples]
Titan V
Tesla V100
-5
0
10
20
30
32 256 2k 16k 128k 1M
Jetson Nano
40
60
80
100
120
complex-to-complex, FP16
Figure 11: Increase in the execution time for optimal fre-
quencies as a percentage of the default execution time td.
complex-to-complex, FP32
FFT length [samples]
Co
m
pu
ta
tio
na
l p
er
fo
rm
an
ce
 [G
FL
O
PS
]
Jetson Nano
Tesla V100
Tesla P4
Titan XP
Titan V
0
1
10
100
1000
10000
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
32 256 2k 16k 128k 1M
complex-to-complex, FP64
FFT length [samples]
Co
m
pu
ta
tio
na
l p
er
fo
rm
an
ce
 [G
FL
O
PS
]
Jetson Nano
Tesla V100
Tesla P4
Titan XP
Titan V
0
1
10
100
1000
32 256 2k 16k 128k 1M
Radix n=2
Radix n>2
Bluestein
32 256 2k 16k 128k 1M
Co
m
pu
ta
tio
na
l p
er
fo
rm
an
ce
 [G
FL
O
PS
]
FFT length [samples]
Jetson Nano
Tesla V100
Titan V
10
100
1000
32 256 2k 16k 128k 1M
complex-to-complex, FP16
Figure 12: Floating-point operations per second
(GFLOPS) for optimal frequencies. The colored region
shows the change from the default frequency.
10
In
cr
ea
se
 in
 e
ne
rg
y 
effi
ci
en
cy
1.0
1.2
1.4
1.6
1.8
2.0
2.2
32 256 2k 16k 128k 1M
FP32
FFT length [samples]
Titan V
Tesla V100
32 256 2k 16k 128k 1M
FP64
Jetson Nano
Tesla P4
Titan XP
32 256 2k 16k 128k 1M
FP16
Figure 13: The increase in the energy efficiency for opti-
mal core clock frequencies with respect to the boost core
clock frequency for all tested FFT lengths. The two
peaks observed in the Jetson Nano data are due to the
use of the Bluestein algorithm.
In
cr
ea
se
 in
 e
ne
rg
y 
effi
ci
en
cy
1.0
1.1
1.2
1.3
1.4
1.5
32 256 2k 16k 128k 1M
FP32
FFT length [samples]
Titan V
Tesla V100
32 256 2k 16k 128k 1M
FP64
Tesla P4
Titan XP
32 256 2k 16k 128k 1M
FP16
Figure 14: The increase in the energy efficiency for the
optimal core clock frequencies with respect to the base
core clock frequency for all tested FFT lengths. The
Jetson Nano is not included since there is no base core
clock frequency.
more, the optimal frequency is roughly the same across
all numerical precisions for a given GPU with the excep-
tion of Tesla P4 GPU. Based on this we have calculated a
mean optimal frequency for a given GPU and precision by
averaging optimal frequencies which achieves a similar in-
creases in energy efficiency for all measured FFT lengths.
The increase in energy efficiency using the mean optimal
frequency is shown in Fig. 15 for the boost frequency and
in Fig. 16 for the base frequency. The values of mean
optimal frequencies are listed in Table 3.
When considering existing pipelines, it is also interest-
ing to study the relationship between the increase in en-
ergy efficiency and the increase in the execution time. This
relationship indicates the cost (in units of execution time)
of any increase in energy efficiency. This is shown for the
V100 GPU in Fig. 17 and for the Jetson Nano in Fig. 18.
Table 3: Mean optimal core clock frequencies.
Card name FP32 [MHz] FP64 [MHz] FP16 [MHz]
Tesla V100 945 945 937
Tesla P4 746 1126 NA
Titan V 952 967 1042
Titan XP 1151 1215 NA
Jetson Nano 460.8 460.8 460.8
In
cr
ea
se
 in
 e
ne
rg
y 
effi
ci
en
cy
1.0
1.2
1.4
1.6
1.8
2.0
2.2
32 256 2k 16k 128k 1M
FP32
FFT length [samples]
Titan V
Tesla V100
32 256 2k 16k 128k 1M
FP64
Tesla P4
Titan XP
Jetson Nano
32 256 2k 16k 128k 1M
FP16
Figure 15: The increase in the energy efficiency for the
mean optimal frequency with respect to the boost core
clock frequency for all tested FFT lengths.The two
peaks observed in the Jetson Nano data are due to the
use of the Bluestein algorithm.
In
cr
ea
se
 in
 e
ne
rg
y 
effi
ci
en
cy
0.9
1.0
1.1
1.2
1.3
1.4
1.5
32 256 2k 16k 128k 1M
FP32
FFT length [samples]
Titan V
Tesla V100
32 256 2k 16k 128k 1M
FP64
Tesla P4
Titan XP
32 256 2k 16k 128k 1M
FP16
Figure 16: The increase in the energy efficiency for the
mean optimal frequency with respect to the base core
clock frequency for all tested FFT lengths. The Jet-
son Nano is not included since there is no base core clock
frequency.
5.3 Integration into existing pipelines
To demonstrate the applicability of the mean optimal fre-
quency in existing pipelines we have employed part of the
data processing pipeline3 used for the detection of pulsars
in time-domain radio astronomy data. The pipeline uses
several computational steps: FFT, power spectrum calcu-
lation; mean and standard deviation calculation; and the
harmonic sum. The harmonic sum adds the value of higher
harmonics of the pulsar in the power spectrum to the pul-
3Source code for used pipeline is on GitHub https://github.
com/KAdamek/cuFFT_energy_efficiency_example
11
In
cr
ea
se
 in
 e
ne
rg
ey
 e
ffic
ie
nc
y 
[%
]
FFT length
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 9 5 11 10 10 8 7 14 7 9 9 5 7 7 9 9
18 25 19 22 28 25 22 18 18 18 20 22 17 20 20 19 20
29 35 30 35 32 37 35 31 35 30 31 33 30 34 32 31 33
37 44 40 41 43 43 44 44 43 42 43 44 42 47 43 41 44
54 58 51 54 53 58 60 54 53 53 56 52 56 61 59 57 58
52 58 51 55 55 55 55 53 57 51 58 58 56 59 56 57 59
32 64 12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
k
32
k
64
k
12
8k
25
6k
51
2k
1M 2M
0.8
1.7
2.6
3.5
4.4
5.2
6.2+
In
cr
ea
se
 in
 ex
ec
ut
io
n 
tim
e 
[%
]
Figure 17: Trade-off between an increase in energy effi-
ciency in percent (represented by a number in each cell)
and an increase in execution time (represented by a color)
for the V100 GPU.
In
cr
ea
se
 in
 e
ne
rg
ey
 e
ffic
ie
nc
y 
[%
]
FFT length
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 12 10 11 10 11 11 11 11 10 10 11 11 10 11 11 11
21 23 23 24 24 25 24 25 24 23 24 24 21 23 24 25 25
35 37 36 37 37 37 36 35 36 35 35 35 33 35 36 38 38
49 54 53 54 53 51 52 48 50 49 49 50 49 50 52 53 53
84 70 71 70 67 67 65 60 63 62 62 63 64 64 66 68 69
100 84 90 87 82 81 77 71 76 76 74 75 77 77 80 83 84
32 64 12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
k
32
k
64
k
12
8k
25
6k
51
2k
1M 2M
10
20
30
40
50
60
70+
In
cr
ea
se
 in
 ex
ec
ut
io
n 
tim
e 
[%
]
Figure 18: Trade-off between an increase in energy effi-
ciency in percent (represented by a number in each cell)
and an increase in execution time (represented by a color)
for the Jetson Nano.
sar’s expected fundamental frequency thus increasing the
signal-to-noise ratio of the pulsar in the power spectrum.
The harmonic sum can add up to 32 higher order har-
monics which decreases the FFT execution footprint in
the pipeline’s total execution time.
To change the frequency during the pipeline execution
we have used the NVIDIA Management Library (NVML)
[28]. This approach, however, has limitations because
the library is fully supported only on scientific (Tesla)
NVIDIA GPUs. The measured power consumption and
the core clock frequency for the V100 GPU are shown in
Fig. 19 and the increase in energy efficiency for different
configurations of the pipeline is listed in Table 4.
The usage of the NVML library is simple. Before the
GPU kernel execution the core clock frequency is (for a
given GPU) set using nvmlDeviceSetGpuLockedClocks
providing maximum and minimum core clock fre-
quency. When the calculation is finished the GPU
core clock frequency is returned to default by calling
nvmlDeviceResetGpuLockedClocks.
The FFT length used for the computation was N =
5 · 105 which was not used in our measurements or in our
calculation of the mean optimal frequency.
5.4 Profiling
For profiling, we have used the NVIDIA visual profiler
(NVVP). Based on the different behavior of the execution
time tfix shown in Fig. 4 we have selected three represen-
tative power-of-two FFT lengths (N = 8192, N = 16k,
Table 4: Increase in energy efficiency for different config-
urations of our toy data processing pipeline.
num. har-
monic
summed
cuFFT % of
total exec.
time
Increase in En-
ergy efficiency
2 60.85 1.291
4 58.56 1.290
8 55.92 1.267
16 53.73 1.260
32 51.34 1.240
Po
w
er
 c
on
su
m
pt
io
n 
[W
]
Tesla V100
without NVML
with NVML
 0
 50
 100
 150
 200
 250
 300
Co
re
 fr
eq
ue
nc
y 
[M
H
z]
Sample index
 250
 500
 750
 1000
 1250
 1500
0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000
Figure 19: Measured power consumption (top) and core
clock frequency (bottom) for part of a radio astronomy
data processing pipeline.
N = 2M) which are calculated by different kernels. The
profiling results for these kernels are shown in Fig. 20. For
our study of compute utilization we have used two indica-
tors. The first is the compute utilization as reported by
the NVVP, the second metric is the issue slot utilization,
which tells us how many instruction slots are used. The
next quantity displayed in Fig. 20 is the device memory
bandwidth utilization (device MBU). Fig. 20 also shows
the normalized execution time from fastest to slowest to
provide context for the other displayed quantities.
6 Discussion
The dependency of the execution time on the core clock
frequency is shown in Fig. 6. Fig. 6 displays the three pre-
viously discussed behaviours a), b) and c). However, the
Jetson Nano only exhibits the third type of behaviour c).
All other GPU’s, represented by the V100 GPU, exhibit a
composition of all three behaviours with cases a) and b)
being dominant.
The behaviour in case b), might be due to reduced cache
contention which slightly increases the hit rate of the uni-
fied cache as shown by the the NVVP. However, it might
also be a systematic error caused by measurement using
the NVIDIA driver, which is based on the GPU core clock
frequency. In this case as well as in case a) the GPU’s com-
pute resources are not fully utilized and the computations
are limited by device memory bandwidth.
The reason for an increase in the execution time at a
12
Fast
Slow
N
or
m
al
iz
ed
 e
xe
cu
tio
n 
tim
e
Core clock frequency [MHz]
N=8192
Normalized execution time Device mem. bandwidth utilization
Floating point operation utilization Issue Slot Utilization
Fast
Slow
3006009001200
N=2M #1
N=65536 #1
3006009001200
N=2M #2
0
20
40
60
80
100
N=65536 #2
3006009001200
0
20
40
60
80
100
N=2M #3
Figure 20: Profiling results for the V100 GPU using the NVIDIA visual profiler. Longer FFT lengths use more than
one GPU kernel to calculate the Fourier transform which are numbered.
particular critical frequency is due to the saturation of the
number of issued instructions (see Fig. 20). This leads to a
reduction in memory requests to the device memory which,
in turn, leads to poor latency hiding of the device memory
accesses. Therefore most of the threads are waiting for
data but there are not enough threads with data to utilize
the floating-point operation units. Thus the floating-point
operation utilization remains mostly unchanged.
The sharp increase in the execution time tfix for low
frequencies, which are present in all cases, are due to the
change of the P-state to a state corresponding to the idle
status of the GPU with reduced voltage which reduces the
available GPU resources severely.
Lastly, case c) occurs due to the high utilization of one
of the caches. Since the cache bandwidth decreases with
the core clock frequency each decrease in frequency lowers
the bandwidth which is already fully utilised leading to a
decrease in performance.
The average power consumption shown in Fig. 8, tells us
why, even with longer execution times, we can improve en-
ergy efficiency. The rate of the decrease in power consump-
tion is higher than the rate at which the execution time
increases. This is especially visible around f = 1000 Hz
for the V100 GPU and about f = 450 Hz for the Jetson
Nano. These frequencies roughly coincide with the mean
optimal frequency for the given GPUs.
6.1 Real-time processing
The energy efficiency is shown in Fig. 10, the change in
the execution time is shown in Fig. 11 and the change in
GFLOPS is shown in Fig. 12.
In the language of costs, Fig. 11 is equivalent to the
increase in capital costs as an increase in execution time
directly translates into more hardware needed in order to
meet the constrains of real-time data processing. On the
other hand, the increase in energy efficiency (Fig. 10) is
related to operational costs, where better energy efficiency
translates into lower operational costs. However we must
bare in mind that operational costs include cooling, fa-
cility management, etc. which could be increased by the
requirement for more hardware due to longer execution
times.
For FP32 precision we see that the Jetson Nano is more
energy efficient than the V100 GPU for almost all FFT
lengths, especially for the small FFT lengths where it is
50% more efficient. When we look at the change in the ex-
ecution time we see that the Jetson Nano requires approxi-
mately 60% more time to finish compared to the execution
time at the boost core clock frequency. With one extreme
case where the execution time is 140% longer. This means
on average 60% more hardware to achieve real-time data
processing with the best energy efficiency.
This behaviour is not reproduced by the V100 GPU
where the increase in energy efficiency is not, for the most
part, at the expense of the execution time. The change
in the execution time for the V100 GPU is below 5%.
13
There are more significant increases in execution time for
the non-power of two FFT lengths which can cause in-
creases of up to 20% in execution time. Small changes in
the execution time on the V100 GPU offers a possibility
to improve existing real-time processing pipelines without
substantial change in hardware.
We see similar behaviour for the V100 GPU at FP64
precision. The slow-down in execution time suffered by
the V100 GPU due to the lower core clock frequencies
is within 5%. The execution time for most of the non-
power of two FFT lengths does not increase above 20%.
The Tesla P4 GPU, Titan XP GPU and Jetson Nano do
not fully support FP64 precision. This manifests in less
significant improvements in GFLOPS/W, much higher ex-
ecution times and a decrease in GFLOPS. In the case of
the Jetson Nano we would have to double the number of
cards in order to process data in real-time.
At FP16 precision we have only three GPUs which sup-
port this precision: V100 GPU, Titan V GPU and Jet-
son Nano. Regarding energy efficiency, both the V100
GPU and the Jetson Nano are comparable but the V100
GPU is the overall more energy efficient GPU. When we
look at the change in execution time we see that the V100
GPU typically has a 10% increase or less, but at some
FFT lengths the increase is as high as 40% (N=64). This
behaviour means that we have to be more careful about
potential energy savings since at some FFT lengths the
increase in execution time might be too high for real-time
data processing. The change in execution time of the Jet-
son Nano is again large and we would need to have almost
twice the number of GPUs to process data in real time at
the best possible energy efficiency.
6.2 Increase in energy efficiency
The increase in the energy efficiency for the optimal fre-
quency is shown in Fig. 13 and Fig. 14. The correspond-
ing figures for the mean optimal frequency are Fig. 15 and
Fig. 16. The difference in the increase in energy efficiency
for the base core clock frequency between the optimal fre-
quency and the mean optimal frequency is 5 percentage
points. That is, an average increase in energy efficiency
for the optimal frequency which is tuned for each FFT
length is 29% whereas the average increase in energy ef-
ficiency for the mean optimal frequency is 24%. For the
V100 GPU this holds for all FFT lengths and precisions
with a very limited number of exceptions for FP16 pre-
cision. For the boost core clock frequency the loss is 10
percentage points. This allows us to use one core clock
frequency and achieve similar energy savings without de-
termining the optimal frequency for each FFT length. A
similar result is observed for the Jetson Nano with the
exception of Bluestein FFT lengths which are responsible
for the peaks in the results.
The dependency between the increase in energy effi-
ciency and the change in the execution time, shown in
Fig. 17 for the V100 GPU but more notably in Fig. 18
for the Jetson Nano, is non-linear. We see that we can
achieve an interesting increase in energy efficiency even
for increases in execution time which are below 10%.
Lastly, our practical test with our example data pro-
cessing pipeline shows that we can dynamically change
the core clock frequency in a very precise manner. Our
code demonstrates how to target only the duration of the
cuFFT library call within the pipeline and thus reduce
power consumption. This technique can be applied to ex-
isting pipelines or more generally any software with mini-
mal changes to the codebase. The increase in energy effi-
ciency (for the boost core clock frequency) are summarized
in Table 4 corresponds to the expected values based on the
FFT execution time footprint within the pipeline. For the
first configuration with 2 harmonics, the FFT execution
time corresponds to 60% of the total execution time. The
average increase in energy efficiency for V100 GPU with
boost core clock frequency (based on Fig.15) is about 50%.
Considering the FFT execution time footprint we should
get 30% increase in energy efficiency which is indeed what
we have measured. This behaviour is consistent with other
configurations of the pipeline.
7 Conclusions
We have measured the power consumption when calcu-
lating the Fourier transformation at different numerical
precisions (FP32, FP64, FP16) on NVIDIA GPUs using
the NVIDIA cuFFT library and quantified the possible
energy savings when DVFS techniques are used. For each
tested GPU, precision, and a wide range of FFT lengths,
we have found the optimal core clock frequency to min-
imise power consumption. We have also measured the
change in execution time of the Fourier transform when
DVFS is applied, which is an important consideration for
real-time data processing because this can increase when
the core clock frequencies of the GPU are modified.
We have presented the achieved energy efficiency in
GFLOPS/W. Along with this we have presented the in-
crease in energy efficiency when using our optimal core
clock frequency compared to the boost and base core clock
frequency for each GPU. We have also presented the in-
crease in the execution time of the Fourier transform when
DVFS is applied.
The decrease in power consumption and change in the
execution time depends on the GPU used. In the case of
the V100 GPU, the average increase in energy efficiency
is for FP32, FP64, and FP16 precisions is 60% compared
to the boost core clock frequency. When compared to the
base core clock frequency an average increase in energy
efficiency of 30% for FP32 and FP64 precision and 20% for
FP16 precision is observed. The increase in the execution
time is below 5% (with few exceptions as outlined). The
Jetson Nano offers higher increases in energy efficiency to
that of the V100 GPU. On average 70% for FP32, 55% for
FP64 and 70% for FP16 but at the expense of execution
time which increases by more than 60%. For the P4 GPU
and the Titan V GPU we have not achieved a significant
increase in energy efficiency.
Our results have shown that the Volta architecture is
significantly more energy efficient than the P4 GPU which
represents the most energy efficient GPU from the pre-
vious Pascal generation. When compared to the Jetson
14
Nano the V100 GPU is less energy efficient at FP32 pre-
cision. For short and long FFTs at FP32 precision the
Jetson Nano is 50% more energy efficient than the V100
GPU. For FP16 precision the V100 GPU has similar en-
ergy efficiency as the Jetson Nano. The Jetson Nano does
not fully support double precision thus the V100 GPU is
significantly more energy efficient at this precision.
We have shown that values of optimal core clock fre-
quencies for all tested FFT lengths for a given GPU and
numerical precision are similar, with few exceptions. This
allowed us to define a mean optimal core clock frequency
unique to each tested GPU and precision, but is the same
for all FFT lengths. Using the mean optimal core clock fre-
quency, we have achieved a similar energy efficiency when
compared to the energy efficiency achieved with the opti-
mal core clock frequency for each tested FFT length. For
the V100 GPU the difference is only 5 percentage points.
For the other GPUs the loss is similar.
We have also presented the practical implementation
of these results in our example data processing pipeline
which is available as an open source code. We have demon-
strated how to change the core clock frequency of the
GPU to the mean optimal core clock frequency using the
NVIDIA Management Library and demonstrated a de-
crease in power consumption which is in agreement with
the results presented in this work.
Finally we have highlighted how, from an environmen-
tal perspective, increasing the energy efficiency of the FFT
algorithm will be an important consideration for edge com-
puting and IoT.
Acknowledgment
This work has received support from STFC Grant
(ST/T000570/1). The authors acknowledge the
support of the OP VVV MEYS funded project
CZ.02.1.01/0.0/0.0/16 019/0000765 ”Research Center for
Informatics”. The authors would like to acknowledge
the use of the University of Oxford Advanced Research
Computing (ARC) facility in carrying out this work
(http://dx.doi.org/10.5281/zenodo.22558). The au-
thors would like to express their gratitude to the Research
Centre for Theoretical Physics and Astrophysics, Institute
of Physics, Silesian University in Opava for institutional
support.
References
[1] Power consumption measurement with nvidia-smi,
March 2018.
[2] Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji In-
oue, Masato Edahiro, and Martin Peres. Power and
performance characterization and modeling of gpu-
accelerated systems. In Proceedings of the 2014 IEEE
28th International Parallel and Distributed Process-
ing Symposium, IPDPS ’14, page 113–122, USA,
2014. IEEE Computer Society.
[3] Karel Ada´mek and Wesley Armour. A GPU Imple-
mentation of the Harmonic Sum Algorithm. In Pe-
ter J. Teuben, Marc W. Pound, Brian A. Thomas,
and Elizabeth M. Warner, editors, Astronomical Data
Analysis Software and Systems XXVII, volume 523
of Astronomical Society of the Pacific Conference Se-
ries, page 489, October 2019.
[4] Karel Ada´mek, Sofia Dimoudi, Mike Giles, and Wes-
ley Armour. Improved Acceleration of the GPU
Fourier Domain Acceleration Search Algorithm. In
Pascal Ballester, Jorge Ibsen, Mauricio Solar, and
Keith Shortridge, editors, Astronomical Data Anal-
ysis Software and Systems XXVII, volume 522 of As-
tronomical Society of the Pacific Conference Series,
page 477, April 2020.
[5] Yehia Arafa, Ammar ElWazir, Abdelrahman ElKa-
nishy, Youssef Aly, Ayatelrahman Elsayed, Abdel-
Hameed Badawy, Gopinath Chennupati, Stephan
Eidenbenz, and Nandakishore Santhi. Verified
instruction-level energy consumption measurement
for nvidia gpus, 2020.
[6] L. Bluestein. A linear filtering approach to the com-
putation of discrete fourier transform. IEEE Transac-
tions on Audio and Electroacoustics, 18(4):451–455,
1970.
[7] Robert A. Bridges, Neena Imam, and Tiffany M.
Mintz. Understanding gpu power: A survey of profil-
ing, modeling, and simulation methods. ACM Com-
put. Surv., 49(3), September 2016.
[8] E. O. Brigham. The fast Fourier transform and its
applications. Prentice Hall, Englewood Cliffs, New
York, signal processing series edition, 1988.
[9] Martin Burtscher, Ivan Zecena, and Ziliang Zong.
Measuring gpu power with the k20 built-in sensor.
In Proceedings of Workshop on General Purpose Pro-
cessing Using GPUs, GPGPU-7, pages 28––36, New
York, NY, USA, 2014. Association for Computing
Machinery.
[10] Vincent Chau, Xiaowen Chu, Hai Liu, and Yiu-Wing
Leung. Energy efficient job scheduling with dvfs for
cpu-gpu heterogeneous systems. In Proceedings of
the Eighth International Conference on Future En-
ergy Systems, e-Energy ’17, pages 1––11, New York,
NY, USA, 2017. Association for Computing Machin-
ery.
[11] Tim Cornwell and Ben Humphreys. Ska exascale soft-
ware challenges, 2010.
[12] Sofia Dimoudi, Karel Ada´mek, Prabu Thiagaraj,
Scott M. Ransom, Aris Karastergiou, and Wesley Ar-
mour. A GPU Implementation of the Correlation
Technique for Real-time Fourier Domain Pulsar Ac-
celeration Searches. ApJS, 239(2):28, December 2018.
15
[13] Muhammad Fahad, Arsalan Shahid, Ravi Reddy
Manumachu, and Alexey Lastovetsky. A compara-
tive study of methods for measurement of energy of
computing. Energies, 12(11), 2019.
[14] J. S. Farnes, B. Mort, F. Dulwich, K. Ada´mek,
A. Brown, J. Novotny´, S. Salvini, and W. Armour.
Building the world’s largest radio telescope: The
square kilometre array science data processor. In 2018
IEEE 14th International Conference on e-Science (e-
Science), pages 366–367, 2018.
[15] R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher,
and Z. Zong. Effects of dynamic voltage and fre-
quency scaling on a k20 gpu. In 2013 42nd Interna-
tional Conference on Parallel Processing, pages 826–
833, 2013.
[16] Joa˜o Guerreiro, Aleksandar Ilic, Nuno Roma, and Pe-
dro Toma´s. Dvfs-aware application classification to
improve gpgpus energy efficiency. Parallel Comput-
ing, 83:93 – 117, 2019.
[17] John W. Tukey James W. Cooley. An algorithm
for the machine calculation of complex fourier series.
Mathematics of Computation, 19(90):297–301, 1965.
[18] Yang Jiao, Heshan Lin, Pavan Balaji, and Wu-chun
Feng. Power and performance characterization of
computational kernels on the gpu. In Proceedings
of the 2010 IEEE/ACM Int’l Conference on Green
Computing and Communications & Int’l Conference
on Cyber, Physical and Social Computing, pages 221–
228. IEEE Computer Society, 2010.
[19] J. L. Jodra, I. Gurrutxaga, and J. Muguerza. A study
of memory consumption and execution performance
of the cufft library. In 2015 10th International Con-
ference on P2P, Parallel, Grid, Cloud and Internet
Computing (3PGCIC), pages 323–327, 2015.
[20] R. Jongerius, S. Wijnholds, R. Nijboer, and H. Cor-
poraal. An end-to-end computing model for the
square kilometre array. Computer, 47(9):48–54, 2014.
[21] J. Lee, V. Sathisha, M. Schulte, K. Compton,
and N. S. Kim. Improving throughput of power-
constrained gpus using dynamic voltage/frequency
and core scaling. In 2011 International Conference on
Parallel Architectures and Compilation Techniques,
pages 111–120, 2011.
[22] L. Levin, W. Armour, C. Baffa, E. Barr, S. Cooper,
R. Eatough, A. Ensor, E. Giani, A. Karastergiou,
R. Karuppusamy, and et al. Pulsar searches with the
ska. Proceedings of the International Astronomical
Union, 13(S337):171–174, 2017.
[23] Dumitrel Loghin and Yong Meng Teo. The energy ef-
ficiency of modern multicore systems. In Proceedings
of the 47th International Conference on Parallel Pro-
cessing Companion, ICPP ’18, pages 1–10, New York,
NY, USA, 2018. Association for Computing Machin-
ery.
[24] Xinxin Mei, Qiang Wang, and Xiaowen Chu. A sur-
vey and measurement study of gpu dvfs on energy
conservation. Digital Communications and Networks,
3(2):89–100, 2017.
[25] E. Meneses, O. Sarood, and L. V. Kale´. Assessing en-
ergy efficiency of fault tolerance protocols for hpc sys-
tems. In 2012 IEEE 24th International Symposium on
Computer Architecture and High Performance Com-
puting, pages 35–42, 2012.
[26] Sparsh Mittal and Jeffrey S. Vetter. A survey of meth-
ods for analyzing and improving gpu energy efficiency.
ACM Comput. Surv., 47(2), August 2014.
[27] NVIDIA. Cufft library user’s guide, 2020 v11.0.3.
[28] NVIDIA. Nvidia management library, 2020 vR450.
[29] A. R. Offringa et al. WSCLEAN: an implementation
of a fast, generic wide-field imager for radio astron-
omy. MNRAS, 444(1):606–619, October 2014.
[30] D. C. Price, M. A. Clark, B. R. Barsdell, R. Babich,
and L. J. Greenhill. Optimizing performance-
per-watt on GPUs in high performance comput-
ing. Computer Science - Research and Development,
31(4):185–193, November 2016.
[31] David Reinsel, John Gantz, and John Rydning. The
digitization of the world from edge to core, 2018.
[32] J. W. Romein. A comparison of accelerator archi-
tectures for radio-astronomical signal-processing al-
gorithms. In 2016 45th International Conference on
Parallel Processing (ICPP), pages 484–489, 2016.
[33] A. Sethia and S. Mahlke. Equalizer: Dynamic tuning
of gpu resources for efficient execution. In 2014 47th
Annual IEEE/ACM International Symposium on Mi-
croarchitecture, pages 647–658, 2014.
[34] Peter Steinbach and Matthias Werner. gearshifft -
The FFT Benchmark Suite for Heterogeneous Plat-
forms. arXiv e-prints, page arXiv:1702.00629, Febru-
ary 2017.
[35] David Strˇela´k and Jiˇr´ı Filipovicˇ. Performance anal-
ysis and autotuning setup of the cufft library. In
Proceedings of the 2nd Workshop on AutotuniNg and
ADaptivity AppRoaches for Energy Efficient HPC
Systems, ANDARE ’18, pages 1–6, New York, NY,
USA, 2018. Association for Computing Machinery.
[36] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xi-
aowen Chu. The impact of gpu dvfs on the energy and
performance of deep learning: An empirical study. In
Proceedings of the Tenth ACM International Confer-
ence on Future Energy Systems, e-Energy ’19, pages
315—-325, New York, NY, USA, 2019. Association
for Computing Machinery.
[37] Anne E. Trefethen and Jeyarajan Thiyagalingam.
Energy-aware software: Challenges, opportunities
and strategies. Journal of Computational Science,
16
4(6):444 – 449, 2013. Scalable Algorithms for Large-
Scale Systems Workshop (ScalA2011), Supercomput-
ing 2011.
[38] Sebastiaan van der Tol, Bram Veenboer, and
Andre´ R. Offringa. Image Domain Gridding: a fast
method for convolutional resampling of visibilities.
Astronomy & Astrophysics, 616:A27, August 2018.
[39] B. Veenboer, M. Petschow, and J. W. Romein. Image-
domain gridding on graphics processors. In 2017
IEEE International Parallel and Distributed Process-
ing Symposium (IPDPS), pages 545–554, 2017.
[40] Bram Veenboer and John W. Romein. Radio-
astronomical imaging: Fpgas vs gpus. In Ramin
Yahyapour, editor, Euro-Par 2019: Parallel Process-
ing, pages 509–521, Cham, 2019. Springer Interna-
tional Publishing.
[41] Qiang Wang, Chengjian Liu, and Xiaowen Chu.
Gpgpu performance estimation for frequency scaling
using cross-benchmarking. In Proceedings of the 13th
Annual Workshop on General Purpose Processing Us-
ing Graphics Processing Unit, GPGPU ’20, pages 31–
–40, New York, NY, USA, 2020. Association for Com-
puting Machinery.
17
