GPGPU Performance Estimation with Core and Memory Frequency Scaling by Wang, Qiang & Chu, Xiaowen
GPGPU Performance Estimation with Core and
Memory Frequency Scaling
Qiang Wang
Department of Computer Science
Hong Kong Baptist University, Hong Kong
qiangwang@comp.hkbu.edu.hk
Xiaowen Chu
Department of Computer Science
Hong Kong Baptist University, Hong Kong
HKBU Institute of Research and Continuing Education,
Shenzhen, China
chxw@comp.hkbu.edu.hk
Abstract—Graphics Processing Units (GPUs) support dynamic
voltage and frequency scaling (DVFS) in order to balance
computational performance and energy consumption. However
there still lacks simple and accurate performance estimation
of a given GPU kernel under different frequency settings on
real hardware, which is important to decide best frequency
configuration for energy saving. This paper reveals a fine-grained
model to estimate the execution time of GPU kernels with
both core and memory frequency scaling. Over a 2.5x range of
both core and memory frequencies among 12 GPU kernels, our
model achieves accurate results (within 3.5%) on real hardware.
Compared with the cycle-level simulators, our model only needs
some simple micro-benchmark to extract a set of hardware
parameters and performance counters of the kernels to produce
this high accuracy.
Index Terms—Graphics Processing Units; Dynamic Voltage
and Frequency Scaling; GPU Performance Modeling;
I. INTRODUCTION
Recently the Graphics Processing Units (GPUs) are becom-
ing widely used from Deep Learning (DL) workstations to
high performance supercomputing centers. In particular, most
popular DL toolkits [1], [2], [3], [4] heavily rely on the remark-
able computation power of GPUs. However, due to rapidly
increasing computation requirements of both DL toolkits and
other GPUs applications on large amount of data, the total
energy consumption can be very high, which not only results
in high electricity budgets but also violates green computing.
For example, the supercomputer Titan accelerated with the
NVIDIA K20x requires a power supply of 8.21 million Watts
with an electricity cost of about 23 million dollars per year [5].
Even decreasing 5% of the power consumption can reduce up
to 1 million dollars of electricity costs. Effective energy saving
techniques are emergent to be designed for GPUs.
Energy conservation techniques on modern computers are
generally based on Dynamic Voltage and Frequency Scaling
(DVFS). Nowadays GPUs usually support simple automatic
voltage and frequency adjustment in order to save power
and protect the hardware. Nevertheless, GPUs hardly gain
the best energy efficiency under the default voltage and
frequency settings [6], [7], [8], [9] and still have potentials
of energy conservation. To find the most energy efficient
DVFS configurations, the energy consumption under different
DVFS settings should be predicted, which requires modeling
both performance and runtime power of GPUs under various
settings of voltage and frequency. In this paper we would like
to address the performance modeling problem.
There are three main challenges of GPU performance pre-
diction under different core and memory frequencies. First,
compared to traditional CPU, GPUs have much more complex
memory hierarchy of which GPU vendors reveal few details.
Second, GPUs have two independent frequencies belonging
to core and memory respectively and they affects different
components of GPUs. Third, resource contention is heavy due
to large number of concurrent threads.
Some previous state-of-the-art work reveals analytical pipe-
line GPU performance model [10], [11], [12], [13], [14], [15]
which emphasizes the relationship between compute cycles
and memory latency. However, there still exist some oppor-
tunities to reinforce the model. First, the L2 cache becomes
larger and larger among the evolutionary GPU generations.
Compared to Fermi 2011, Maxwell 2014 has four times larger
L2 cache. Larger cache generally can increase the cache hit
rate, which reminds us to consider more on L2 cache latency
and throughput instead of DRAM. Second, most of the pre-
vious models only work under the default frequency settings
of GPU. The kernel behavior could change significantly when
the core and memory frequencies have been adjusted.
Simulation methods [16], [17], [18] expose sufficient details
to help understanding the execution of GPU kernels. The
best available simulator to date [19] combines performance
counters and specific hardware information to estimate the
kernel execution time with high accuracy. However, compared
with the fast evolving GPU generations, such simulators still
stand for the earlier GPU architecture like Fermi, which
does not meet the great changes of newer hardware. Besides,
these simulators usually consume much longer time than real
hardware, which are difficult to be applied to real-time power-
performance optimization.
Recent GPU performance models [20], [21], [22], [23], [24]
also witness the trend of Machine Learning methods such as
K-means, multiple linear regression and neural network. and
obtain considerable accuracy. However, few of them introduce
frequency scaling as impact factors in their models. Besides,
their works strongly rely on training data such as specific
performance counters and kernel settings. Even they can reveal
ar
X
iv
:1
70
1.
05
30
8v
2 
 [c
s.P
F]
  1
3 J
un
 20
18
some correlations between the input parameters and execution
time, it needs further explorations of how they interact with
each other and contribute to the final time prediction.
We believe that a fast and accurate GPU performance
model is a key ingredient for energy conservation with DVFS
technique and it should be applicable to real hardware. In this
paper, we first attempt to model the memory system of GPU
with a FCFS (First-come-first-serve) queue in which service
rate depends on the memory frequency. Based on that, we
propose a GPGPU performance estimation model considering
both core and memory frequency scaling. Our paper reveals
following contributions:
1) We model the memory system of GPU with a simple
queue related with the frequency.
2) We establish an analytical GPU performance model with
both core and memory frequency scaling.
3) On a real GPU hardware, our performance model
achieves 3.5% MAPE (Mean Average Percentage Error)
across 49 frequency settings with up to 2.5x scaling
among 12 kernels. Meanwhile, we achieve 0.7% to 6.9%
MAPE for each single kernel, which suggests great
accuracy and low variance of our performance model.
The rest of this paper is organized as follows. Section II
introduces some basic knowledge about GPU and DVFS tech-
niques followed by a motivating example about performance
scaling behaviors with different frequency settings. Section
III lists some related work. Section IV details our memory
queuing model for GPU with frequency scaling and based on
it, Section V proposes our GPGPU performance estimation
model with both core and memory frequency scaling. Section
VI describes our experimental setup and presents the experi-
mental results. Finally we state our conclusion and future work
in Section VII.
II. BACKGROUND AND MOTIVATION
A. GPU Architecture
Over past five years NVIDIA has released five generations
of GPUs. The new functions and improvements through each
updated version can be obtained from [25]. Despite of some
different hardware configuration like core number and memory
size, the basic chip framework is almost the same. Fig. 1 shows
a brief block diagram of Maxwell GTX980 GPU. The structure
details can be found in its official white paper. Note that
GPUs have complicated memory hierarchy including dynamic
random-access memory (DRAM), L2 cache, shared memory,
texture/L1 cache and registers. Different memory types have
their own characteristic in terms of latency, bandwidth and
access scope, which makes it difficult to predict the execution
time of a GPU kernel.
B. Dynamic Voltage and Frequency Scaling
DVFS is one of the most typical techniques of energy
conservation for traditional CPUs. The dynamic power is
usually estimated by Equation (1) [26]. Since the total energy
consumption of one application is obtained by multiplying the
average runtime power and the execution time, performance
Texture/L1 
Cache
Shared 
Memory
Instruction 
Cache
Regs
Core
SFU
LD/ST
Stream Processors * 16
L2 Cache
Memory Controller
DRAM DRAM DRAM
Instruction Buffer * 4
Warp Scheduler * 4
128
*
Cores
32
*
SFU
Units
32
*
LD/ST
Units
Instruction Cache
Texture/L1 Cache
Shared Memory
Fig. 1: The block diagram of NVIDIA GTX980 GPU board.
modeling plays an important role in energy consumption
prediction with different DVFS settings. For traditional CPUs,
scaling up the frequency is usually a good option to save en-
ergy [27]. However, some previous GPU DVFS work indicates
that GPUs have more complex energy scaling behaviors when
adopting different voltages and frequencies and scaling up the
frequency does not often help reduce the energy consumption
[6], [7].
Pdynamic = aCV
2f (1)
Modern GPUs have two main frequency domains. One is
core frequency which mainly controls the speed of stream
multiprocessors (SMs) while the other is memory frequency
which affects the bandwidth of DRAM. Table I summarizes
the dominating frequency for different types of memory. Note
that only DRAM works under memory frequency and L2
cache works under core frequency though they both serve the
global memory requests.
TABLE I: DOMINATING FREQUENCY FOR DIFFERENT
COMPONENTS.
Components Dominating Frequency
DRAM memory frequency
L2 Cache core frequency
Shared Memory core frequency
Texture Cache core frequency
Register core frequency
C. Performance Scaling Behaviors with Frequency
As different GPU applications may have various utilizations
of different hardware components, changing the frequencies
may lead to diverse performance scaling behaviors for them.
As a motivating example, we test a set of frequency pairs on
six GPU kernels to observe how the execution time changes.
We first fix the core frequency to 400 MHz and 1000
MHz respectively and scale the memory frequency from 400
MHz to 1000 MHz with a step size of 100 MHz. Illustrated
by Fig. 2(a) and 2(b), some kernels like transpose (TR),
blackScholes (BS), vectorAdd (VA) and convolutionSeparable
(convS) have almost over 2.5x speedup by increasing 2.5x
memory frequency, while the other two matrix multiplication
with global memory (MMG) and with shared memory (MMS)
have negligible speedup. Another intesting finding is that two
matrix multiplication kernels MMG and MMS have different
scaling behaviors under different core frequency. Higher core
frequency allows them to have higher speedup when increasing
the memory frequency. The possible reason is that the perfor-
mance is restricted by core frequency when core frequency is
low while the performance is restricted by memory frequency
when core frequency is high enough to drive the computational
power.
300 400 500 600 700 800 900 1000 1100
Memory Frequency/MHz, F
core
 = 400MHz
1
1.5
2
2.5
3
Sp
ee
du
p
BS
MMG
MMS
TR
VA
convS
(a)
300 400 500 600 700 800 900 1000 1100
Memory Frequency/MHz, F
core
 = 1000MHz
1
1.5
2
2.5
3
Sp
ee
du
p
BS
MMG
MMS
TR
VA
convS
(b)
300 400 500 600 700 800 900 1000 1100
Core Frequency/MHz, F
mem
 = 400MHz
1
1.5
2
2.5
3
Sp
ee
du
p
BS
MMG
MMS
TR
VA
convS
(c)
300 400 500 600 700 800 900 1000 1100
Core Frequency/MHz, F
mem
 = 1000MHz
1
1.5
2
2.5
3
Sp
ee
du
p
BS
MMG
MMS
TR
VA
convS
(d)
Fig. 2: Performance scaling behavior under different frequency
settings. The upper two figures show the speedup of different
GPU kernels when increasing memory frequency with fixed
core frequency. The below two figures show the speedup of
different GPU kernels when increasing core frequency with
fixed memory frequency.
Then we fix the memory frequency to 400 MHz and 1000
MHz respectively, and scale the core frequency from 400 MHz
to 1000 MHz. Fig. 2(c) and 2(d) show that core frequency
has little effects on the performance of TR, BS and VA but
great impacts on the other three. It is also observed that the
performance can be limited by different frequency domain
with different frequency settings.
As a result, under different frequency settings, the per-
formance scaling behaviors can be diverse and complicated
among different GPU kernels. Our goal is to establish an
estimation model that can predict the execution time of a
given GPU kernel under different core and memory frequency
settings. In order to achieve it, we first explore how core
and memory frequencies affect different types of memory
including DRAM, L2 cache and shared memory. That gives us
a quantitative model to estimate different memory latencies.
Then we use profiling tools to extract some performance
counters from running the kernel under the baseline frequency
settings. Collaborated with the memory model and the profil-
ing data, the kernel execution time can be estimated under
other frequency settings.
III. RELATED WORK
To derive an accurate performance model of GPUs, it is
quite important to understanding its complex memory hierar-
chy. Henry Wong et al. [28] and develop a micro-benchmark
suite and measure some characteristics such as cache structure
and latency of various memory types, TLB parameter, latency
and throughput of arithmetic and logic operations. Meltzer [29]
extended similar work on Tesla C2070. In addition, Xinxin
Mei et al. [30], [31] also conduct similar dissection but address
more on memory hierarchy. They propose a fine-grained P-
chase method to explore the cache parameters with uncommon
structure and replacement policy which appears in the latest
generation of GPU (Kepler and Maxwell). However, such
methods usually test a single kernel with only few threads
executing one type of instructions. When thousands of threads
access memory simultaneously, which generally happens in
GPU applications, the memory bandwidth might not satisfy the
demands and some operations would be stalled in the queue
of memory controller (MC). Such cases lead to high variance
in memory access latency.
Hong et al. [10], [11] proposed an analytical model by
estimating different degree of memory parallelism and compu-
tation parallelism with some offline information of the kernel
program. Furthermore, Sim et al. [14] improves the above
MWP-CWP model by considering cache effects, SFU char-
acteristics and instruction throughput. However, their methods
ignore the effects of shared memory latency and DRAM
memory latency divergence, which may bring some significant
biases in some memory-bounded application. Song et al. [22]
extend their models and address more on different types of
memory access by collecting some simple counters. However,
the model averages the cache effects among all the warps
and potentially ignores memory latency divergence in some
asymmetric applications.
Nath et al. [12] present CRISP model which analyze the
performance in the face of different frequencies of compute
cores. They pointed out that DVFS on GPU is different
from that on CPU since computation operations and memory
transactions from different threads can overlap in most of
the time. Based on the characteristics of GPU performance
with varying frequencies found from experiments, they classify
different execution stages in the kernel program and compute
them with various frequencies. However, CRISP only works
for the case of either scaling down the core frequency or
scaling up the memory frequency. Also the model may be
more complicated if considering the memory frequency. Joao
et al. [9] proposed a GPU power estimation model with
both core and memory frequency scaling. They designed well
crafted microbenchmarks to extract the model parameters of
each GPU components under the default frequency setting.
Then they attempted to predict the power consumption of an
application under over a wide range of frequency scaling.
Recent state-of-art works witness the advantages of machine
learning methods on GPU performance and power modeling.
Gene Wu et al. [20] built a performance model based on
different patterns of scaling with various core frequency and
memory frequency. He firstly adopted K-means to cluster
different patterns of scaling behavior among 37 kernels and
then explored the relationship between performance counters
and clustering patterns with ANN modeling. With the model
trained with large amount of data, one can predict the perfor-
mance of one kernel under any setting of core frequency and
memory frequency with the predicted scaling pattern. Wang
et al. [8] adopted SVR algorithm to estimate GPU power
consumption considering both core and memory frequency
scaling.
IV. MEMORY MODELING WITH FREQUENCY SCALING
In the previous performance modeling work, memory la-
tency is usually set as a constant parameter obtained by micro-
benchmarking. However, since the DRAM in the GPUs can
be accessed by any threads running on any SMs, the memory
latency of each thread may vary due to intensive memory
transactions. Besides, memory latency can also change with
different frequency settings. In this section, we first model
the memory latency with a simple queueing model. Then
we measure the parameters used in the model with different
frequency settings for further performance modeling.
A. DRAM latency
When one warp launches a memory request to the global
memory, it usually takes about hundreds of cycles to go
through DRAM if the data is not cached. This minimum
latency happens when the memory system is idle and only
contains the overhead of path traveling and data access.
Fig. 3 shows this case. The inter-arrival intervals between
two consequent memory requests is shorter than the time
consumption of loading data from DRAM. Thus, each memory
request only costs the minimum latency. To compute the total
latency of finishing all the memory requests in this case, we
only need to care how many memory requests are executed
by a warp. Inferred from Fig. 3, the total time consumption
T lat of all the memory requests can be calculated by
Equation 2. interArr denotes the inter-arrival time of two
consequent memory requests. #W denotes the total warp
number. dm lat denotes the minimum memory latency with
no memory contention. gld trans denotes the global memory
transactions of one warp.
1 1
1 1
1 1
Memory Latency DRAM Load
2 2
2 2
2 2
1 1
1 1
1 1
2 2
2 2
2 2
Time Time
Warp 1
Warp 2
Warp 3
Warp 1
Warp 2
Warp 3
Fig. 3: Execution time pipeline of infrequent DRAM requests.
The number on the block indicates the iterations of one warp.
T lat = interArr ×#W + dm lat× gld trans (2)
If the memory system is saturated due to the intensive memory
requests, the minimum latency can hardly be achieved. Most
requests should waiting in the queue until the previous ones
have been finished. In Fig. 4, each memory request is launched
with a very short interval so that each memory request not
only takes the minimum latency but also the queueing delay,
which means the waiting time in the queue. Thus, intensive
memory access demands can lead to diverse memory latency.
In this case, we can calculate the total time consumption by
Equation 3. dm del denotes the service time of one memory
transaction.
Memory Latency Memory Queue DelayDRAM Load
1 1
1 1
1 1
1 1
1 1
1 1
2 2
2 2
2 2
2 2
2 2
2 2
Time
Warp 1
Warp 2
Warp 3
Fig. 4: Execution time pipeline of intensive DRAM requests.
The number on the block indicates the iterations of one warp.
T lat = dm lat+ dm del × gld trans×#W (3)
We also observe this phenomenon in experiments. We revise
0 100 200 300 400 500
Warp Number
7.3154
7.3155
7.3156
7.3157
7.3158
7.3159
7.316
7.3161
7.3162
7.3163
Ti
m
e 
St
am
p/
cy
cl
e
×108
Start Time End Time
(a)
0 100 200 300 400 500
Warp Number
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
M
em
or
y 
La
te
nc
y/
cy
cl
e
(b)
Fig. 5: Experimental results of memory access latency. The
time stamp data in 5(a) is sorted by starting time in ascending
order. The memory access latency of each warp in 5(b) is
ascendingly re-ordered.
the global bandwidth benchmark code of [31] and add clock
measuring function clock() to collect memory latency sam-
ples. To reduce as much overhead of clock() and time record-
ing in global memory as possible, we sample only one request
for some threads. Fig. 5 shows our experimental results. We
can infer two things from the results. First, the memory latency
can be diverse because of the intensive requests. Second, the
memory latency is somehow linearly correlated with warp
numbers, which meets the model in Fig. 4. Then we would like
to explore how frequency scaling affects dm lat and dm del.
We first utilize the global latency benchmark code of [31] to
measure the dm lat under different memory frequencies. For
simplicity, we only measure the latency in the case that the
TLB cache is hit. Table II shows part of our results. The cycle
is the time unit under GPU core frequency. We also measure
dm lat with other frequency combinations and find that it can
be fitted by Equation 4 with 0.9959 R-squared.
dm lat = 222.78× core f/mem f + 277.32 (4)
As for DRAM delay measurement, we also use the global
bandwidth benchmark code of [31] to achieve as high DRAM
bandwidth as possible. Table III shows part of our results.
To calculate dm del in cycles, we transfer the total execu-
tion time into cycles by multiplying the time and the core
frequency and calculate how many memory transactions of
all warps. Then we can infer dm del by Equation 3 with
the obtained dm lat. We observe the DRAM delay is also
correlated with the bandwidth efficiency, which means the
percentage of DRAM bandwidth utilization. By increasing
memory frequency, the cycles of dm del become smaller and
the bandwidth efficiency becomes larger, which suggests that
high memory frequency helps improve utilization of DRAM
bandwidth.
TABLE II: MINIMUM DRAM LATENCY UNDER DIFFERENT
MEMORY FREQUENCIES
Memory
Freq./MHz
Core
Freq./MHz
Cycles
400 400 500
500 500 455.5
600 600 425.8
700 700 404.6
800 800 388.7
900 900 376.3
1000 1000 366.4
TABLE III: DRAM READ DELAY UNDER DIFFERENT MEMORY
FREQUENCIES
Memory
Freq./MHz
Core
Freq./MHz
Cycles Bandwidth
Efficiency
400 400 10.06 76%
500 500 9.76 78.13%
600 600 9.54 79.8%
700 700 9.31 81.83%
800 800 9.19 83.42%
900 900 9.06 84.51%
1000 1000 9.0 85%
B. L2 cache
With the development of GPU generations, L2 cache
becomes larger and larger (e.g. from 512 KB for Fermi
GTX560Ti to 2 MB for Maxwell GTX980) in order to reduce
the pressure of memory system. As mentioned before, different
cache hit rate may bring different sensitivity to frequency
scaling. Similar with the DRAM measurement experiments,
we also use the same global latency code to obtain L2 cache
latency under different frequency settings. We observe that the
latency is always within 220 to 224 cycles. This is reasonable
because L2 cache is affected by core frequency. Thus, we take
the average 222 cycles for L2 minimum latency. Besides, we
choose 1 cycle for l2 del due to the truth that L2 cache can
return a memory request per core cycle with the same reason.
C. Adjustment with frequency scaling
For simplicity, our performance model defines baseline fre-
quency setting in which the ratio of memory frequency to core
frequency is one. We measure some basic latency and through-
put for all the components including core computation, shared
memory access, constant memory, L2 cache and DRAM under
baseline setting. We can use the standard Average Memory
Access Time (AMAT) [32] model to obtain average global
memory access latency agl lat and average queueing delay
agl del of all the global memory transactions happenin during
kernel execution with Equation 5a and 5b. l2 hr denotes L2
cache hit rate of the kernel. core f and mem f denotes the
frequencies of core and memory respectively.
agl lat =l2 lat× l2 hr + dm lat
× (core f/mem f)× (1− l2 hr) (5a)
agl del =l2 del × l2 hr + dm del
× (core f/mem f)× (1− l2 hr) (5b)
Since we calculate the execution time in the scope of core
frequency, there is no extra adjustment to those latency and
throughput in SM.
V. GPU PERFORMANCE MODELING WITH FREQUENCY
SCALING
The SMs in GPUs execute threads in groups of 32 parallel
threads called warps [25]. Generally GPU can launch a large
amount of warps during the whole kernel execution period.
However, due to the hardware resource limitation, one SM
can only executes a certain number (denoted by #Aw) of
warps concurrently called active warps, denoted by #Aw. Once
we obtain the time consumption of a round of active warps,
denoted by Tactive, the total execution time of a kernel can
be estimated by Equation 6. #B denotes the total number of
thread blocks; #Wpb denotes the number of warps per block;
#SM denotes the number of SMs.
Texec = Tactive ∗ (#Wpb ∗#B/(#Aw ×#SM)) (6)
Basically, a GPU kernel can be divided into several segments.
Some segment does not not access shared memory, while some
does. Since they are influenced by different frequencies and
usually have different working patterns, we classify the GPU
kernels segments into two categories by whether shared mem-
ory transactions happen during the kernel execution. Some
kernels also utilize texture/L1 cache for further performance
improvement and somehow affect the accuracy of our model.
We will leave it as future improvement work for our current
model.
A. Performance Modeling without Shared Memory
The first case of our model is that the GPU kernel does not
utilize shared memory. In this case, the kernel only contains
computation in SMs and global memory transactions. As
described in Section IV, we can estimate the time consumption
of global memory transactions. As for computation part, we
simply assume that there happens the same computation time
before each global memory transaction as shown in Fig. 6. We
divide the total compute instructions (denoted by comp inst)
by total global memory transactions gld trans to obtain
average compute instructions number (denoted by avr inst).
The average computation time (denoted by avr comp) before
each global memory transaction can be estimated by Equation
(7a) and (7b).
avr inst = comp inst/gld trans (7a)
avr comp = inst cycle× avr inst (7b)
If the kernel launches large number of computation instruc-
tions and only few memory requests which do not saturate the
memory bandwidth, the computation period will dominate the
total kernel execution time. On the other hand, if the memory
access latency is somehow much longer than computation
cycle due to intensive memory requests, the computation
period can be hidden by memory operations. The first case
is regarded as compute-dominated or compute-bound kernel
while the second is memory-dominated or memory-bound.
C
C
C
C
C
C
C
Memory Latency
Memory Queue Delay
DRAM Load
Computation
C
C
C
C
C
C
Time
Block 1
(3 warps)
Block 2
(3 warps)
Fig. 6: Execution time pipeline of a compute-dominated ker-
nel. Since the kernel launches enough warps containing long
compute cycles, most of the memory latency can be hidden.
C
Memory Latency
Memory Queue Delay
DRAM Load
Computation
C
C
C
C
C
C
C
C
C
C
C
C
Time
Block 1
(3 warps)
Block 2
(3 warps)
Fig. 7: Execution time pipeline of a memory-intensive kernel.
Since each warp issues a few computation instructions but
frequent memory access requests, one memory transaction
cannot be processed until all outstanding transactions have
been finished.
1) Compute-Dominated: When there are enough compu-
tation instructions to be issued and the memory requests are
not too intensive due to long computation period, the global
memory latency can be hidden, as illustrated by Fig. 6. In
this case, Equation (8a) and (8b) should be satisfied. We can
estimate Tactive by Equation (9). o itrs denotes the repeat
times of one computation period and one global memory
transaction.
avr comp ≥ agl del (8a)
(avr comp× (#Aw − 1)) ≥ agl lat (8b)
Tactive =avr comp×#Aw × o itrs+ agl lat (9)
2) Memory-Dominated: When the memory bandwidth is
saturated or there are not enough warps to issue computation
instructions, one memory request is waiting until all the
outstanding requests have been finished. Fig. 7 demonstrates
this case. The condition is described as Equation (10a) and
(10b). Similar with the case in Fig. 4, we can regard the com-
pute cycles as inter-arrival time of two consequent memory
requests. We can estimate Tactive by Equation (11) by focusing
on the agl del of each warp.
avr comp ≤ agl del (10a)
(avr comp+ agl lat) ≥ (agl del × (#Aw − 1)) (10b)
Tactive =agl lat+ avr comp
+ agl del ×#Wpb× o itrs (11)
C
Memory Latency
Memory Queue Delay
DRAM Load
Computation
C
C
C
C
Time
C C
Warp 1
Warp 2
Warp 3
Fig. 8: Execution time pipeline of a kernel containing few
warps that have short computation periods. Thus, the first
memory request of each warp has waiting period while the
rest do not.
When the kernel launches only a few warps, most latency
cannot be hidden which leads to insufficient utilizations of
the GPU. The memory latency may contribute a lot to the
execution time. There are two cases identified by whether
avr comp is shorter than agl del. When avr comp is shorter
than agl del, the first memory request of each warp should
have waiting period as Fig. 8 shows. It can be described as
Equation (12a) and (12b). We can estimate Tactive by Equation
(13) for this case.
C
Memory Latency
Memory Queue Delay
DRAM Load
Computation
C C
C C
C C
Time
Warp 1
Warp 2
Warp 3
Fig. 9: Execution time pipeline of a kernel containing few
warps that have long computation periods. Thus, each memory
request do not need to wait and can be processed immediately.
avr comp ≤ agl del (12a)
(avr comp+ agl lat) ≤ (agl del × (#Aw − 1)) (12b)
Tactive =agl del ×#Aw + agl lat+ avr comp
+ (avr comp+ agl lat)× (o itrs− 1) (13)
When avr comp is longer than agl del, all the memory
requests can be processed immediately once issued as Fig.
9 shows. It can be described as Equation (14a) and (14b). We
can estimate Tactive by Equation (15).
avr comp ≥ agl del (14a)
(avr comp× (#Aw − 1)) ≤ agl lat (14b)
Tactive =avr comp× (#Aw − 1)
+ (avr comp+ agl lat)× o itrs (15)
B. Performance Modeling with Shared Memory
Shared memory plays an important role in GPU perfor-
mance optimization since it has lower latency and higher
throughput compared to DRAM and can be shared among
the threads within the same block. Its latency and throughput
is affected by core frequency. One GPU kernel may have
different patterns of utilization of shared memory which makes
the performance estimation complicated. Basically, there are
three phases for this kind of kernels. At the beginning, all
the warps send memory requests to global memory and then
store the data into shared memory. Second, threads within the
same block access shared memory for computation. Finally,
the results are written back to global memory. According to the
access workload to shared memory, we design the performance
models for two cases as follows.
Memory Latency DRAM Load/Store
Computation Shared Read/Write
Block 1
(4 warps)
Block N
(4 warps)
Fig. 10: Execution time pipeline of a kernel containing infre-
quency shared memory access. Since there are very few shared
memory transactions, the first block finishes them quickly and
begins to send global memory request to DRAM while the
final block even have not finished the first global memory
transaction.
1) Shared memory requests are infrequent: Some kernels
may only have few iterations of shared memory requests. In
this case, since the latency of shared memory access is much
lower than that of global memory access, the shared memory
latency can often be hidden by global memory latency. Fig.
10 shows this case. In the first phase, all the warps are
launching global memory requests and storing the data into
shared memory, which results in heavy traffic in DRAM.
In the second phase, each warp only executes one shared
memory access consuming only a small number of cycles.
Then each warp writes the results back to global memory,
which again launchs quite a number of global memory trans-
actions. The pattern is similar with Fig. 7 except that it is
shared memory latency to be hidden. The condition is given
by Equation (16a) and (16b) for this case and we can estimate
the execution time by Equation (17). shm lat denotes shared
memory latency. Transpose with coalesced optimization is
one instance of this case. Since the shared memory latency
can be hidden, the kernel is not sensitive to core frequency
but memory frequency, which also meets the results of our
previous motivating examples.
avr comp ≤ agl del (16a)
(avr comp+ shm lat) ≤ (agl del × (#Aw −#Wpb)) (16b)
Tactive =avr comp+ agl lat
+ agl del ×#Aw × gld trans (17)
Load data from DRAM 
to Shared Memory
Load data from Shared 
Memory and Compute
Memory Latency
DRAM 
Load/Store
Computation
Shared 
Read/Write
Block 1
Block 2
Block N
4 warps
Fig. 11: Execution time pipeline of matrix multiplication
with shared memory. Phase 1 contains a large number of
global memory requests. Phase 2 contains multiple shared
memory operations. Since Phase 2 is long enough to hide
the global memory latency of other blocks, the rest Phase 1
only has memory contention within the same block due to
synchronization function.
2) Shared memory requests are intensive : If the shared
memory access is intensive, its latency can contribute to
the final execution time significantly. Fig. 11 shows matrix
multiplication with shared memory as an example. At the
beginning all the warps load data from global memory. Then
in the second phase each warp access the shared memory for
multiple times, which consumes much time. Since the total
shared memory latency of phase 2 is longer enough to hide
global memory latency of other blocks, the global requests
within the same block have no contention with others. Due to
the function of synchronization, we can regard this procedure
as repetition of these two big steps. Thus, the total execution
time of one round can be calculated by Eq. (18)∼(21). i itrs
denotes the number of shared memory transactions within
phase2.
Tphase1 =avr comp× 2 + agl del × gld trans
×#Aw + agl lat+ sh lat (18)
Tphase2 =avr comp× (warps per block − 1)
+ (avr comp+ sh lat)× i itrs (19)
Tphase3 =avr comp× 2 + agl del × gld trans
×#Wpb+ agl lat+ sh lat (20)
Tactive =Tphase1 + (Tphase2 + Tphase3)× o itrs (21)
Matrix multiplication with shared memory optimization is
one instance of this case. Its performance is sensitive to
both core and memory frequency which is revealed in our
previous motivating examples and it can be explained by our
model. First, Tphase1 and Tphase3 contain a large number
of global memory transactions which makes the execution
time sensitive to memory frequency. Second, although shared
memory latency is much shorter than global memory latency,
nearly 3 dozens of shared memory requests in Tphase2 also
contribute a lot to the final Tactive, which makes the exe-
cution time sensitive to core frequency as well. These two
cases can be adopted to most classical GPU kernels. Though
there are other more complicated irregular instances such as
MC EstimatePiInlineP and reduction, the similar methodology
of phase partition can somehow apply to them. We leave
detailed analysis for these irregular kernels for future work.
VI. EXPERIMENTS
A. Experimental Methodology
With the help of NVIDIA Inspector [33], we can fix the
performance state and adjust core frequency and memory
frequency of GPU together within a certain range. By this
method we can obtain execution time data with certain fre-
quency combinations of GPU. We cover both core frequency
and memory frequency at a 2.5x range of scaling from 400
MHz to 1000 MHz with a step size of 100 MHz so that totally
49 frequency combinations are tested.
We use NVIDIA Nsight tools [34] to extract the perfor-
mance counters we need to drive our model at the baseline
frequency 700 MHz for both core and memory. Note that
we only need one time data collection with this method,
which makes our model work fast. We choose 700 MHz for
baseline since it leaves the space for raising and declining the
frequency, which suggests that our performance model is more
general and flexible.
We validate our model among 12 realistic GPU kernels from
CUDA SDK 6.5 listed in Table VI on a real NVIDIA Maxwell
GTX980. Hardware specifications of our test machine are
listed in Table V. These benchmark applications cover a wide
range of execution pattern such as DRAM intensive, L2 cache
intensive, shared memory intensive and computation intensive.
We repeat our experiments for 1000 times and report the
average results.
BS CG FWT MMGMMS MS SC SN SP TR VA convS
0
0.2
0.4
0.6
0.8
1
N
or
m
al
iz
at
io
n 
in
st.
 p
ro
po
rti
on
gld_trans L2_trans shm_trans comp_inst
Fig. 12: Breakdown of different types of instructions
B. Experimental Results
First, we would like to observe the instruction distributions
of different GPU kernels with the help of NVIDIA Nsight
tools. As Fig. 12 demonstrates, our tested kernels have various
partitions of different types of instructions, which suggests
that we attempt to design a general model for different types
of GPU kernels. In addition, such instruction statistics help
us locate the principle contributors to the execution time
under certain frequency settings. Combined with experimental
results, we can also infer some error resources of under- or
over-estimation of execution time.
Fig. 13 shows the time prediction error under varing
memory frequency with fixed core frequency and vice versa.
Across 49 available frequency settings among 12 kernels, we
achieve below 16% error for each prediction and even 90% of
them are under 10%. As for each kernels, the mean absolute
percentage error (MAPE) ranges from 0.7% to 6.9% as shown
in Fig. 14. We achieve 3.5% MAPE across all the testing
samples. As mentioned before, some prediction errors can be
explained by instruction distributions in a kernel. For example,
MatMul(S) have relative bigger under-estimation errors in
13(a) than those in 13(b). The possible reason is that the time
consumption in SM, shared memory access in particular, is
under-estimated. As for MatMul(G), although it launches a
great number of global memory transactions, it has a high
L2 cache hit rate up to 97.5%, which results in its sensitivity
to core frequency as well. Some kernels like convSp, FWT
and SP have approximately linear decreasing errors with larger
memory frequency in 13(a) and 13(b) but stable errors in 13(c)
and 13(d). These also reveal that this kind of kernels are more
sensitive to memory frequency, which are also supported by
the facts that they have high proportion of DRAM transactions.
VII. CONCLUSION
In this work, we demonstrated a GPGPU performance
predictor for a wide range of both core and memory frequency.
We first estimate the total time consumption of multiple
memory requests under different frequency settings. Then our
predictor takes the profiling data of a given kernel under our
baseline frequency setting as input to estimate the performance
of it at other core and memory frequencies. Our model can
predict the execution time of a GPU kernel on real hardware
TABLE IV: PARAMETERS USED FOR PERFORMANCE MODELING
Parameters Definition How to achieve
agl lat average latency of global memory transaction considering L2 cache hit rate Equation (5a)
agl del average delay of global memory transaction considering L2 cache hit rate Equation (5a)
dm lat dram latency of one global transaction microbenchmarking
dm del dram delay of one global transaction microbenchmarking
interArr Inter-arrival time between two consequent memory requests microbenchmarking
l2 lat L2 cache latency of one global transaction microbenchmarking
l2 del L2 cache delay of one global transaction hardware specification
l2 hr Hit rate at L2 cache for all transactions from SM Nsight profiling
sh lat latency of one shared memory transaction microbenchmarking
gld trans Number of global load/store transactions of one warp in one iteration Nsight profiling
comp inst total compute instructions of the kernel Nsight profiling
avr comp average computation time of one period Equation (7b)
inst cycle latency for each computation instruction hardware specification
#B Total number of blocks kernel setup
#Wpb Number of warps per block kernel setup
#W Number of total warps of a kernel kernel setup
#Asm Active number of SMs Nsight profiling
#Aw Number of warps run concurrently on one SM Nsight profiling
o itrs Number of first level iteration within a thread source code analysis
i itrs Number of second level iteration within a thread source code analysis
Tlat Total execution time of multiple global memory requests Equation (2), (3)
Tactive Cycles for executing one round of active warps on a SM Equation (9), (11), (13), (15), (17), (21)
Texec Total execution time of a kernel Equation (6)
core f frequency that controls the speed of SM Adjustments
mem f frequency that controls the speed of DRAM Adjustments
TABLE V: TARGET GPU FREQUENCY CONFIGURATIONS
Device GTX 980
Compute apability 5.2
SMs * cores per SM 16 * 128
Global Memory bus width 256-bit
Global Memory size 4GB
Core frequency scaling [400MHz, 1000MHz]
Memory scaling [400MHz, 1000MHz]
Scaling stride 100MHz
TABLE VI: TESTED APPLICATIONS
abbr. Application Name
BS BlackScholes
CG conjugateGradient
FWT fastWalshTransform
MMG matrixMul(Global)
MMS matrixMul(Shared)
SC scan
SN sortingNetworks
SP scalarProd
TR transpose
VA vector addition
convSp convolutionSeparable
quickly and accurately, which is important to derive real-time
energy conservation suggestions with DVFS techniques.
We shows that our performance estimation method can
achieve 3.8% MAPE across up to 2.5x both core and memory
frequency scaling. Our experimental results also indicate that
our model can catch not only the performance scaling behav-
iors of DRAM very precisely but also L2 cache and shared
memory.
As for future work, we have two directions of improve-
ments. First, our model does not explore too much about
shared memory as we treat on DRAM and even does not take
texture/L1 cache and constant memory into account, which
may introduce larger error for kernels containing access re-
quests to them. Second, collaborated with GPU power models,
it is potentially a remarkable project to build a real-time
voltage and frequency controller for GPU based on energy
conservations strategies with DVFS techniques.
ACKNOWLEDGMENT
This work is supported by Shenzhen Basic Research Grant
SCI-2015-SZTIC-002.
REFERENCES
[1] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like
environment for machine learning,” in BigLearn, NIPS Workshop, no.
EPFL-CONF-192376, 2011.
[2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin et al., “TensorFlow: Large-scale
machine learning on heterogeneous systems, 2015,” Software available
from tensorflow. org, vol. 1, 2015.
[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia, 2014, pp. 675–678.
[4] X. Huang, “Microsoft Computational Network Toolkit offers most
efficient distributed deep learning computational performance,” https:
//goo.gl/9UUwVn, 2015, accessed: 2016-07-12.
[5] O. R. N. Laboratory, “Introducing Titan: advancing the era of acceler-
ating computing,” [Online] https://www.olcf.ornl.gov/titan/.
[6] X. Mei, Q. Wang, and X. Chu, “A Survey and Measurement
Study of GPU DVFS on Energy Conservation,” Accepted by
Digital Communication and Network (DCN). [Online]. Available:
https://arxiv.org/abs/1610.01784
[7] R. A. Bridges, N. Imam, and T. M. Mintz, “Understanding gpu power:
A survey of profiling, modeling, and simulation methods,” ACM Com-
puting Surveys (CSUR), vol. 49, no. 3, p. 41, 2016.
[8] Q. Wang and X. Chu, “GPGPU Power Estimation with Core
and Memory Frequency Scaling,” SIGMETRICS Perform. Eval.
Rev., vol. 45, no. 2, pp. 73–78, Oct. 2017. [Online]. Available:
http://doi.acm.org/10.1145/3152042.3152066
[9] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “GPGPU Power Modeling
for Multi-domain Voltage-Frequency Scaling,” in 2018 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA),
Feb 2018, pp. 789–800.
[10] S. Hong and H. Kim, “An Analytical Model for a GPU Architecture with
Memory-level and Thread-level Parallelism Awareness,” in Proceedings
of the 36th Annual International Symposium on Computer Architecture,
ser. ISCA ’09. New York, NY, USA: ACM, 2009, pp. 152–163.
BS CG FWT MMG MMS MS SC SN SP TR VA convSp
-30
-20
-10
0
10
20
30
Pr
ed
ic
tio
n 
Er
ro
r(%
)
core_f = 400 MHz
400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz
BS CG FWT MMG MMS MS SC SN SP TR VA convSp
-30
-20
-10
0
10
20
30
Pr
ed
ic
tio
n 
Er
ro
r(%
)
core_f = 1000 MHz
400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz
BS CG FWT MMG MMS MS SC SN SP TR VA convSp
-30
-20
-10
0
10
20
30
Pr
ed
ic
tio
n 
Er
ro
r(%
)
mem_f = 400 MHz
400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz
BS CG FWT MMG MMS MS SC SN SP TR VA convSp
-30
-20
-10
0
10
20
30
Pr
ed
ic
tio
n 
Er
ro
r(%
)
mem_f = 1000 MHz
400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz
Fig. 13: Time Prediction Error under different frequency settings. Each figure shows the results of scaling one of the frequencies
when the other is fixed.
BS CG FWT MMGMMS MS SC SN SP TR VA convSp
0
5
10
15
M
A
PE
(%
)
Fig. 14: Mean absolute percentage error average across all
available frequency pairs
[11] ——, “An Integrated GPU Power and Performance Model,” in Pro-
ceedings of the 37th Annual International Symposium on Computer
Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp.
280–289.
[12] R. Nath and D. Tullsen, “The CRISP performance model for dynamic
voltage and frequency scaling in a GPGPU,” in Proceedings of the 48th
International Symposium on Microarchitecture. ACM, 2015, pp. 281–
293.
[13] R. Miftakhutdinov, E. Ebrahimi, and Y. N. Patt, “Predicting performance
impact of dvfs for realistic memory systems,” in Proceedings of the 2012
45th Annual IEEE/ACM International Symposium on Microarchitecture.
IEEE Computer Society, 2012, pp. 155–165.
[14] J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, “A performance analysis
framework for identifying potential benefits in GPGPU applications,” in
ACM SIGPLAN Notices, vol. 47, no. 8. ACM, 2012, pp. 11–22.
[15] X. Chen, Y. Wang, Y. Liang, Y. Xie, and H. Yang, “Run-time technique
for simultaneous aging and power optimization in gpgpus,” in Design
Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE. IEEE,
2014, pp. 1–6.
[16] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.
Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations
in gpgpus,” in Proceedings of the 40th Annual International Symposium
on Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM,
2013, pp. 487–498.
[17] J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink, “How
a single chip causes massive power bills GPUSimPow: A GPGPU power
simulator,” in Performance Analysis of Systems and Software (ISPASS),
2013 IEEE International Symposium on. IEEE, 2013, pp. 97–106.
[18] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,
“Analyzing CUDA workloads using a detailed GPU simulator,” in
Performance Analysis of Systems and Software, 2009. ISPASS 2009.
IEEE International Symposium on. IEEE, 2009, pp. 163–174.
[19] T. M. Aamodt, W. W. Fung, I. Singh, A. El-Shafiey, J. Kwa, T. Hether-
ington, A. Gubran, A. Boktor, T. Rogers, A. Bakhoda et al., “GPGPU-
Sim 3. x manual,” 2012.
[20] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou,
“GPGPU performance and power estimation using machine learning,”
in High Performance Computer Architecture (HPCA), 2015 IEEE 21st
International Symposium on. IEEE, 2015, pp. 564–576.
[21] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and M. Peres, “Power
and performance characterization and modeling of gpu-accelerated sys-
tems,” in Parallel and Distributed Processing Symposium, 2014 IEEE
28th International. IEEE, 2014, pp. 113–122.
[22] S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simplified
and accurate model of power-performance efficiency on emergent gpu
architectures,” in Parallel & Distributed Processing (IPDPS), 2013 IEEE
27th International Symposium on. IEEE, 2013, pp. 673–686.
[23] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka,
“Statistical power modeling of GPU kernels using performance coun-
ters,” in Green Computing Conference, 2010 International. IEEE, 2010,
pp. 115–122.
[24] T. T. Dao, J. Kim, S. Seo, B. Egger, and J. Lee, “A performance
model for gpus with caches,” Parallel and Distributed Systems, IEEE
Transactions on, vol. 26, no. 7, pp. 1800–1813, 2015.
[25] NVIDIA, “CUDA C Programming Guide,” [Online]
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[26] V. Kursun and E. G. Friedman, “Supply and Threshold Voltage Scaling
Techniques,” Multi-Voltage CMOS Circuit Design, pp. 45–84, 2006.
[27] D. H. Kim, C. Imes, and H. Hoffmann, “Racing and Pacing to Idle:
Theoretical and Empirical Analysis of Energy Optimization Heuristics,”
in Cyber-Physical Systems, Networks, and Applications (CPSNA), 2015
IEEE 3rd International Conference on. IEEE, 2015, pp. 78–85.
[28] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU microarchitecture through
microbenchmarking,” in Performance Analysis of Systems & Software
(ISPASS), 2010 IEEE International Symposium on. IEEE, 2010, pp.
235–246.
[29] R. Meltzer, C. Zeng, and C. Cecka, “Micro-benchmarking the C2070,”
in GPU Technology Conference. Citeseer, 2013.
[30] X. Mei, K. Zhao, C. Liu, and X. Chu, “”Benchmarking the Memory
Hierarchy of Modern GPUs”,” in Network and Parallel Computing,
2014, pp. 144–156.
[31] X. Mei and X. Chu, “Dissecting GPU Memory Hierarchy through
Microbenchmarking,” IEEE Transactions on Parallel and Distributed
Systems, doi:10.1109/TPDS.2016.2549523.
[32] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantita-
tive approach. Elsevier, 2011.
[33] Orbmu2k, “NVIDIA Inspector,” [Online]
http://blog.orbmu2k.de/tools/nvidia-inspector-tool.
[34] NVIDIA, “NVIDIA Nsight Visual Studio Edition,” [Online]
https://developer.nvidia.com/nvidia-nsight-visual-studio-edition.
