A Similarity Measure for GPU Kernel Subgraph Matching by Lim, Robert et al.
A Similarity Measure for GPU Kernel
Subgraph Matching
Robert Lim, Boyana Norris, and Allen Malony
University of Oregon, Eugene, OR, USA
{roblim1,norris,malony}@cs.uoregon.edu
Abstract. Accelerator architectures specialize in executing SIMD (sin-
gle instruction, multiple data) in lockstep. Because the majority of CUDA
applications are parallelized loops, control flow information can provide
an in-depth characterization of a kernel. CUDAflow is a tool that stati-
cally separates CUDA binaries into basic block regions and dynamically
measures instruction and basic block frequencies. CUDAflow captures this
information in a control flow graph (CFG) and performs subgraph match-
ing across various kernel’s CFGs to gain insights into an application’s
resource requirements, based on the shape and traversal of the graph,
instruction operations executed and registers allocated, among other in-
formation. The utility of CUDAflow is demonstrated with SHOC and Ro-
dinia application case studies on a variety of GPU architectures, reveal-
ing novel control flow characteristics that facilitate end users, autotuners,
and compilers in generating high performing code.
1 Introduction
Structured programming consists of base constructs that represent how programs
are written [4,27]. When optimizing programs, compilers typically operate on the
intermediate representation (IR) of a control flow graph (CFG), which is derived
from program source code analysis and represents basic blocks of instructions
(nodes) and control flow paths (edges) in the graph. Thus, the overall program
structure is captured in the CFG and the IR abstracts machine-specific intrinsics
that the compiler ultimately translates to machine code. The IR/CFG allows the
compiler to reason more efficiently about optimization opportunities and apply
transformations. In particular, compilers can benefit from prior knowledge of
optimizations that may be effective for specific CFG structures.
In the case of accelerated architectures that are programmed for SIMD paral-
lelism, control divergence encountered by threads of execution presents a major
challenge for applications because it can severely reduce SIMD computational ef-
ficiency. It stands to reason that by identifying the structural patterns of a CFG
from an accelerator (SIMD) program, insight on the branch divergence problem
[22] might be gained to help in their optimization. Current profiling approaches
to understanding thread divergence behavior (e.g., [10,21,24]) do not map per-
formance information to critical execution paths in the CFG. While accelerator
ar
X
iv
:1
70
7.
02
42
3v
3 
 [c
s.D
C]
  2
1 M
ar 
20
19
2 Lim, Norris, Malony
devices (e.g., GPUs) offer hardware performance counters for measuring com-
putational performance, it is more difficult to apply them to capture divergence
behavior [17].
Our research focuses on improving the detail and accuracy of control flow
graph information in accelerator (GPU) programs. We study the extent to which
CFG data can provide sufficient context for understanding a GPU kernel’s execu-
tion performance. Furthermore, we want to investigate how effective knowledge
of CFG shapes (patterns) could be in enabling optimizing compilers and au-
totuners to infer execution characteristics without having to resort to running
execution experiments. To this end, we present CUDAflow, a scalable toolkit for
heterogeneous computing applications. Specifically, CUDAflow provides a new
methodology for characterizing CUDA kernels using control flow graphs and in-
struction operations executed. It performs novel kernel subgraph matching to
gain insights into an application’s resource requirements. To the knowledge of
the authors, this work is a first attempt at employing subgraph matching for
revealing control flow behavior and generating efficient code.
Contributions described in this paper include the following.
– Systematic process to construct control flow graphs for GPU kernels.
– Techniques to perform subgraph matching on various kernel CFGs and GPUs.
– Approaches to reveal control flow behavior based on CFG properties.
The rest of the paper is organized as follows. Section 2 discusses prior work,
and Section 3 provides background information. Section 4 describes the method-
ology behind our CUDAflow tool and our implementation approach. Sections 5
and 6 summarizes the findings of our application characterization studies. Sec-
tion 7 outlines future work.
2 Prior Work
Control flow divergence in heterogeneous computing applications is a well known
and difficult problem, due to the lockstep nature of the GPU execution paradigm.
Current efforts to address branch divergence in GPUs draw from several fields,
including profiling techniques in CPUs, and software and hardware architectural
support in GPUs. For instance, Sarkar demonstrated that the overall execution
time of a program can be estimated by deriving the variances of basic block
regions [23]. Control flow graphs for flow and context sensitive profiling were
discussed in [2,3], where instrumentation probes were inserted at selected edges
in the CFG, which reduced the overall profiling overhead with minimal loss of
information. Hammock graphs were constructed [30] that mapped unstructured
control flow on a GPU [11,28]. By creating thread frontiers to identify early
thread reconvergence opportunities, dynamic instruction counts were reduced
by as much as 633.2%.
Lynx [12] creates an internal representation of a program based on PTX and
then emulates it, which determines the memory, control flow and parallelism of
the application. This work closely resembles ours but differs in that we perform
A Similarity Measure for GPU Kernel Subgraph Matching 3
workload characterization on actual hardware during execution. Other perfor-
mance measurement tools, such as HPCToolkit [1] and DynInst [20], provide
a way for users to construct control flow graphs from CUDA binaries, but do
not analyze the results further. The MIAMI toolkit [19] is an instrumentation
framework for studying an application’s dynamic instruction mix and control
flow but does not support GPUs.
Subgraph matching has been explored in a variety of contexts. For instance,
the DeltaCon framework matched arbitrary subgraphs based on similarity scores
[15], which exploited the properties of the graph (e.g., clique, cycle, star, bar-
bell) to support the graph matching. Similarly, frequent subgraph mining was
performed on molecular fragments for drug discovery [5], whereas document clus-
tering was formalized in a graph database context [14]. The IsoRank authors
consider the problem of matching protein-protein interaction networks between
distinct species [25]. The goal is to leverage knowledge about the proteins from
an extensively studied species, such as a mouse, which when combined with a
matching between mouse proteins and human proteins can be used to hypoth-
esize about possible functions of proteins in humans. However, none of these
approaches apply frequent subgraph matching for understanding performance
behavior of GPU applications.
nvcc
source.cuuser
construct
CFG
sample PC
counter
CUDAflow
profiler
CUDAflow
analysis
basic block counts
instruction mixes
CFG matching
Fig. 1: Overview of our proposed CUDAflow methodology.
3 Background
Our CUDAflow approach shown in Figure 1 works in association with the current
nvcc toolchain. Control flow graphs are constructed from static code analysis
and program execution statistics are gathered dynamically through program
counter sampling. This measurement collects counts of executed instructions
4 Lim, Norris, Malony
Kepler Maxwell Pascal Kepler Maxwell Pascal Kepler Maxwell Pascal
START
root_6
L_68
L_65
L_66
L_67
STOP
START
root_2
STOP
START
root_6
L_42
L_39
L_41
STOP
L_40
START
root_2
L_8
L_9
L_10
L_11
L_12
L_13
STOP
START
root_1
L_1
L_2
L_3
STOP
START
root_2
L_5
L_6
L_7
STOP
START
root_8
L_55
L_56
L_57
L_58
L_59
STOP
L_60
START
root_3
L_16
L_17
L_18
L_19
STOP
L_20
START
root_5
L_24
L_25
L_26
L_27
STOP
L_28
L_29
L_30
BFS kernel warp Reduction reduce SPMV csr scalar
START
root_1
L_1
STOP
START
root_1
L_2
L_1
STOP
START
root_2
L_3
L_2
STOP
START
root_4
L_102
L_103
L_104
L_105
START
root_2
L_7
L_8
L_9
STOP
START
root_4
L_71
L_72
L_73
STOP
L_74
L_75
START
root_1
L_1
L_2
L_4
L_5
L_3
STOP
START
root_1
L_1
L_2
L_4
STOP
L_3
START
root_1
L_1
L_2
L_4
STOP
L_3
Hotspot calc temp Particlefilter sum kernel Pathfinder dynproc kernel
Fig. 2: Control flow graphs generated for each CUDA kernel, comparing archi-
tecture families (Kepler, Maxwell, Pascal).
and corresponding source code locations, among other information. In this way,
the CUDAflow methodology provides a more accurate characterization of the
application kernel, versus hardware performance counters alone, which lack the
ability to correlate performance with source line information and are prone to
miscounting events [16]. In particular, it gives a way to understand the control
flow behavior during execution.
Kernel Control Flow Graphs One of the more complex parameters used to
characterize SIMD thread divergence is by using a control flow graph (CFG)
representation of the computation. A CFG is constructed for each GPU kernel
computation in program order and can be represented as a directed acyclic graph
G = (N,E, s), where (N,E) is a finite directed graph, and a path exists from the
START node s ∈ N to every other node. A unique STOP node is also assumed
in the CFG. A node in the graph represents a basic block (a straight line of code
without jumps or jump targets), whereas directed edges represent jumps in the
control flow.
Each basic block region is incremented with the number of times the node
is visited. Upon sampling the program counter, the PC address is referenced
internally to determine to which basic block region the instruction corresponds
to.
. L 41 :
/∗04a0∗/ DSETP.LE.AND P0 ,PT, |R6| ,+INF ,PT;
/∗04a8∗/ @P0 BRA ‘ ( . L 43 ) ;
/∗04b0∗/ LOP32I .OR R5 , R7 , 0x80000 ;
/∗04b8∗/ MOV R4 , R6 ;
/∗04 c8∗/ BRA ‘ ( . L 42 ) ;
A Similarity Measure for GPU Kernel Subgraph Matching 5
The SASS assembly code illustrates how a control flow graph is constructed.
Each basic block is labeled in the left margin (e.g. “.L 41”), with predication and
branch instructions representing edges that lead to corresponding block regions
(e.g. “.L 43,” “.L 42”). The PC offsets are listed in hexadecimal between the
comments syntax (/ ∗ ∗/). In other words, “.L 41” represents a node ni, with
“.L 43,” and “.L 42” as its children.
Example control flow graphs for selected SHOC (top) [9] and Rodinia (bot-
tom) [6] GPU benchmarks are displayed in Figure 2. Different GPU architecture
types will result in the nvcc compiler producing different code and possibly
control flow, as seen in the CFGs from Figure 2 for Kepler, Maxwell and Pascal
architectures. Section 5 discusses the differences in GPU architectures. The CFG
differences for each architecture are due in part to the architecture layout of the
GPU and its compute capability (NVIDIA virtual architecture). The Maxwell
generally uses fewer nodes for its CFGs, as evident in kernel warp. Our approach
can expose these important architecture-specific effects on the CFGs. Also, note
that similarities in structure exist with several CFGs, including csr scalar and
sum kernel. Part of the goal of this research is to predict the required resources
for the application by inferring performance through CFG subgraph matching,
with the subgraphs serving as building blocks for more nested and complex GPU
kernels. For this purpose, we introduce several metrics that build on this CFG
representation.
Transition probability Transition probabilities represent frequencies of an
edge to a vertex, or branches to code regions, which describes the application
in a way that gets misconstrued in a flat profile. A stochastic matrix could also
facilitate in eliminating dead code, where states with 0 transition probabilities
represent node regions that will never be visited. Kernels employing structures
like loops and control flow increase the complexity analysis, and knowledge of
transition probabilities of kernels could help during code generation.
A canonical adjacency matrix M represents a graph G such that every di-
agonal entry of M is filled with the label of the corresponding node and every
off-diagonal entry is filled with the label of the corresponding edge, or zero if no
edge exists [29]. The adjacency matrix describes the transition from Ni to Nj .
If the probability of moving from i to j in one time step is Pr(j|i) = mi,j , the
adjacency matrix is given by mi,j as the i
th row and the jth column element.
Since the total transition probability from a state i to all other states must be
1, this matrix is a right stochastic matrix, so that
∑
j Pi,j = 1.
Figure 3 illustrates transition probability matrices for a kernel from the
Pathfinder application (Tab. 2, bottom-rt.), comparing Kepler (left) and Maxwell
(right) versions. Note that the Pascal version was the same as Maxwell, as evident
in Fig. 2, lower-right, and was left out intentionally. The entries of the transition
probability matrix were calculated by normalizing over the total number of ob-
servations for each observed node transition i to j. Although the matrices differ
in size, observe that a majority of the transitions take place in the upper-left
triangle, with a few transitions in the bottom-right, for all matrices. The task
6 Lim, Norris, Malony

R1 L1 L4 L3 L2 L5
.21 − − − − −
0 .04 − − − −
0 .04 .38 − − −
0 0 0 .08 − −
0 0 0 0 .21 −
0 0 0 0 .02 0


R1 L3 L2 L1
.30 − − −
0 .51 − −
0 0 0 −
0 0 0 .21

Fig. 3: Transition probability matrices for Pathfinder (dynproc kernel) appli-
cation, comparing Kepler (left) and Maxwell (right) versions.
is to match graphs of arbitrary sizes based on its transition probability matrix
and instruction operations executed, among other information.
Hybrid Static and Dynamic Analysis We statically collect instruction mixes
and source code locations from generated code and map the instruction mixes to
the source locator activity as the program is being run [17]. The static analysis of
CUDA binaries produces an objdump file, which provides assembly information,
including instructions, program counter offsets, and source line information. The
CFG structure is stored in iGraph format [8]. We attribute the static analysis
from the objdump file to the profiles collected from the source code activity
to provide runtime characterization of the GPU as it is being executed on the
architecture. This mapping of static and dynamic profiles provides a rich under-
standing of the behavior of the kernel application with respect to the underlying
architecture.
4 Methodology
Based on the kernel CFG and transition probability analysis, the core of the
CUDAflow methodology focuses on the problem of subgraph matching. In order
to perform subgraph matching, we first scale the matrices to the same size by
taking for graphs G1 and G2 the maximal proper submatrix, constructed by
B(Gi) = max(|V1|, |V2|) for a given Gi = min(|V1|, |V2|) using spline interpo-
lation. The similarities in the shapes of the control flow graphs, the variants
generated for each GPU (Table 2) and the activity regions in the transition
probability matrices (Fig. 3) provided motivation for this approach. In our case,
the dense hotspots in the transition matrix should align with their counterparts
if the matrices are similar enough.
4.1 Bilinear Interpolation
To scale the transition matrix before performing the pairwise comparison, we
employ a spline interpolation procedure. Spline interpolation is general form of
A Similarity Measure for GPU Kernel Subgraph Matching 7
linear interpolation for functions of n-order polynomial, such as bilinear and
cubic. For instance, a spline on a two-order polynomial performs bilinear inter-
polation on a rectilinear 2D grid (e.g. x and y) [13]. The idea is to perform linear
interpolation in both the vertical and horizontal directions. Interpolation works
by using known data to estimate values at unknown points. Refer to [13] for the
derivation of bilinear interpolation.
Table 1: Distance measures considered in this paper.
Abbrev Name Result
Euc Euclidean
√∑n
i=1 |xi − yi|2
Iso IsoRank (I− αQ×P)x
Man Manhattan
∑n
i=1 |xi − yi|
Min Minkowski p
√∑n
i=1 |xi − yi|p
Jac Jaccard
∑n
i=1(xi−yi)2∑n
i=1 x
2
i+
∑n
i=1 y
2
i−
∑n
i=1 xiyi
Cos Cosine 1−
∑n
i=1 xiyi√∑n
i=1 x
2
i
√∑n
i=1 y
2
i
4.2 Pairwise Comparison
Once the matrix is interpolated, the affinity scores (S1 and S2 for graphs G
′
1
and G′2, respectively) are matched via a distance measure, which includes the
Euclidean distance, the IsoRank solution [25], Manhattan distance, Minkowski
metric, Jaccard similarity, and Cosine similarity. The distance measures consid-
ered in this work are listed in Table 1. By definition, sim(Gi, Gj) = 0 when i = j,
with the similarity measure placing progressively higher scores for objects that
are further apart.
5 Experimental Setup
To demonstrate our CUDAflow methodology, we measured the performance of
applications on several GPU architectures.
5.1 Execution environment
The graphic processor units used in our experiments are listed in Table 2. The se-
lected GPUs reflect the various architecture family generations, and performance
results presented in this paper represent GPUs belonging to the same family. For
instance, we observed that the performance results from a K80 architecture and
a K40 (both Kepler) were similar, and, as a result, did not include comparisons
of GPU architectures within families. Also, note the changes in architectural fea-
tures across generations (global memory, MP, CUDA cores per MP), as well as
8 Lim, Norris, Malony
ones that remain fixed (constant memory, warp size, registers per block). For in-
stance, while the number of multiprocessors increased in successive generations,
the number of CUDA cores per MP (or streaming multiprocessors, SM) actually
decreased. Consequently, the number of CUDA cores (MP× CUDAcores per mp)
increased in successive GPU generations.
Table 2: Graphical processors used in this experiment.
K80 M40 P100
CUDA capability 3.5 5.2 6.0
Global memory (MB) 11520 12288 16276
Multiprocessors (MP) 13 24 56
CUDA cores per MP 192 128 64
CUDA cores 2496 3072 3584
GPU clock rate (MHz) 824 1140 405
Memory clock rate (MHz) 2505 5000 715
L2 cache size (MB) 1.572 3.146 4.194
Constant memory (bytes) 65536 65536 65536
Shared mem blk (bytes) 49152 49152 49152
Registers per block 65536 65536 65536
Warp size 32 32 32
Max threads per MP 2048 2048 2048
Max threads per block 1024 1024 1024
CPU (Intel) Haswell Ivy Bridge Haswell
Architecture family Kepler Maxwell Pascal
5.2 Applications
Rodinia and SHOC application suite are a class of GPU applications that cover
a wide range of computational patterns typically seen in parallel computing.
Table 3 describes the applications used in this experiment along with source code
statistics, including the number of kernel functions, the number of associated files
and the total lines of code.
Rodinia Rodinia is a benchmark suite for heterogeneous computing which in-
cludes applications and kernels that target multi-core CPU and GPU platforms
[6]. Rodinia covers a wide range of parallel communication patterns, synchroniza-
tion techniques, and power consumption, and has led to architectural insights
such as memory-bandwidth limitations and the consequent importance of data
layout.
SHOC Benchmark Suite The Scalable HeterOgeneous Computing (SHOC)
application suite is a collection of benchmark programs testing the performance
and stability of systems using computing devices with non-traditional archi-
tectures for general purpose computing [9]. SHOC provides implementations for
CUDA, OpenCL, and Intel MIC, and supports both sequential and MPI-parallel
execution.
A Similarity Measure for GPU Kernel Subgraph Matching 9
Table 3: Description of SHOC (top) and Rodinia (bottom) benchmarks studied.
Name Ker File Ln Description
S
H
O
C
FFT 9 4 970 Forward and reverse 1D fast Fourier transform.
MD 2 2 717 Compute the Lennard-Jones potential from molecular
dynamics.
MD5Hash 1 1 720 Computate many small MD5 digests, heavily depen-
dent on bitwise operations.
Reduction 2 5 785 Reduction operation on an array of single or double
precision floating point values.
Scan 6 6 1035 Scan (parallel prefix sum) on an array of single or
double precision floating point values.
SPMV 8 2 830 Sparse matrix-vector multiplcation.
Stencil2D 2 12 1487 A 9-point stencil operation applied to a 2D dataset.
R
o
d
in
ia
Backprop 2 7 945 Trains weights of connecting nodes on a layered neural
network.
BFS 2 3 971 Breadth-first search, a common graph traversal.
Gaussian 2 1 1564 Gaussian elimination for a system of linear equations.
Heartwall 1 4 6017 Tracks changing shape of walls of a mouse heart over
a sequence of ultrasound images.
Hotspot 1 1 1199 Estimate processor temperature based on floor plan
and simulated power measurements.
Nearest Neighbor 1 2 385 Finds k-nearest neighbors from unstructured data set
using Euclidean distance.
Needleman-Wunsch 2 3 1878 Global optimization method for DNA sequence align-
ment.
Particle Filter 4 2 7211 Estimate location of target object given noisy mea-
surements in a Bayesian framework.
Pathfinder 1 1 707 Scan (parallel prefix sum) on an array of single or
double precision floating point values.
SRAD v1 6 12 3691 Diffusion method for ultrasonic and radar imaging ap-
plications based on PDEs.
SRAD v2 2 3 2021 ...
6 Analysis
To illustrate our new methodology, we analyzed the SHOC and Rodinia appli-
cations at different granularities.
6.1 Application level
Figure 4 projects goodness as a function of efficiency, which displays the similari-
ties and differences of the benchmark applications. The size of bubble represents
the number of operations executed, whereas the shade represents the GPU type.
Efficiency describes how gainfully employed the GPU floating-point units re-
mained, or FLOPs per second:
efficiency =
opfp+opint + opsimd + opconv
timeexec
· callsn (1)
The goodness metric describes the intensity of the floating-point and memory
operation arithmetic intensity:
goodness =
∑
j∈J
opj · callsn (2)
10 Lim, Norris, Malony
Note that efficiency is measured via runtime, whereas goodness is measured
statically. Figure 4 (left) shows a positive correlation between the two measures,
where the efficiency of an application increases along with its goodness. Static
metrics, such as goodness, can be used to derive dynamic behavior of an applica-
tion. This figure also demonstrates that merely counting the number of executed
operations is not sufficient to characterize applications because operation counts
do not fully reveal control flow, which is a source of bottlenecks in large-scale
programs.
0.0 0.2 0.4 0.6 0.8
Efficiency
0.0
0.5
1.0
1.5
2.0
2.5
3.0
G
oo
dn
es
s bac
bfs gau
hea
hot
nw
par
pat
sra
sra
bacbfs
gauheahot
nw
par
pat
sra
sra
bac
bfs
hotparpat
r
sra
fft
md
red
sca
spm
ste
fft
md
red
sca
spm
ste
fftmd
red
scaspm
ste
Kepler
Maxwell
Pascal
Goodness as a Function of Efficiency
0 10 20 30 40
|G_1| - |G_2|
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
Eu
cl
id
ea
n 
M
ea
su
re
Euclidean Distance over Node Difference
Fig. 4: Left: The static goodness metric (Eq. 2) is positively correlated with the
dynamic efficiency metric (Eq. 1). The color represents the architecture and
the size of bubbles represents the number of operations. Right: Differences in
vertices between two graphs, as a function of Euclidean metric for all GPU
kernel combinations. Color represents intensity.
6.2 CFG subgraph matching
Distribution of Matched Pairs Figure 4 (right) projects the distribution of
differences in vertices |V | for all 162 CFG kernel pairs (Table 3, 2nd col. + 3
GPUs) as a function of the Euclidean measure (application, architecture, ker-
nel), with shade representing the frequency of the score. Note that most matched
CFGs had a similarity score of 1.5 to 2.2 and had size differences under 10 ver-
tices. Figure 4 (right) also shows that as the differences in vertices increase,
similarity matching becomes degraded due to the loss of quality when interpo-
lating missing information, which is expected. Another observation is that strong
similarity results when node differences of the matched kernel pairs were at a
minimum, between 0 and 8 nodes.
Error Rates from Instruction Mixes Here, we wanted to see how far off
our instruction mix estimations were from our matched subgraphs. Figure 5
A Similarity Measure for GPU Kernel Subgraph Matching 11
p.
r.s
._Z
11
sr
p.
r.s
._Z
11
sr
m
.r.
s.
_Z
11
sr
m
.r.
s.
_Z
11
sr
p.
r.s
._Z
11
sr
m
.r.
s.
_Z
11
sr
k.
s.
m
._Z
16
co
k.
s.
m
._Z
16
co
m
.s
.s
._Z
22
sp
m
.s
.s
._Z
22
sp
k.
s.
m
._Z
16
co
m
.s
.s
._Z
22
sp
m
.s
.s
._Z
22
sp
m
.s
.s
._Z
22
sp
p.
r.p
._Z
10
su
p.
r.p
._Z
10
su
m
.s
.s
._Z
22
sp
m
.r.
p.
_Z
10
su
p.
r.p
._Z
10
su
m
.r.
p.
_Z
10
su
m
.r.
p.
_Z
10
su
0
5
10
15
20
25
30
M
e
a
n
 A
b
so
lu
te
 E
rr
o
r 
(%
)
Error Rates for MD Kernel
FLOPS
MemOps
CtrlOps
1.00
1.05
1.10
1.15
1.20
1.25
1.30
Is
o
R
a
n
k
k.
r.b
._Z
24
bp
k.
r.b
._Z
7K
er
k.
r.g
._Z
4F
an
k.
r.s
._Z
7p
re
k.
r.s
._Z
4s
ra
m
.r.
b.
_Z
6K
er
p.
r.b
._Z
6K
er
p.
r.s
._Z
4s
ra
k.
s.
f._
Z1
3c
h
k.
s.
f._
Z1
3c
h
k.
s.
s.
_Z
11
bo
k.
r.b
._Z
6K
er
0
5
10
15
20
25
30
35
M
e
a
n
 A
b
so
lu
te
 E
rr
o
r 
(%
) Error Rates for Backprop Kernel
FLOPS
MemOps
CtrlOps
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
Is
o
R
a
n
k
p.
r.s
._Z
11
sr
m
.r.
s.
_Z
11
sr
k.
s.
m
._Z
16
co
m
.s
.s
._Z
22
sp
m
.s
.s
._Z
22
sp
p.
r.p
._Z
10
su
m
.r.
p.
_Z
10
su
k.
s.
m
._Z
16
co
0
5
10
15
20
25
30
35
M
e
a
n
 A
b
so
lu
te
 E
rr
o
r 
(%
)
Error Rates for SPMV Kernel
FLOPS
MemOps
CtrlOps
1.00
1.05
1.10
1.15
1.20
1.25
1.30
Is
o
R
a
n
k
Fig. 5: Error rates when estimating instruction mixes statically from runtime
observations for selected matched kernels (x-axis), with IsoRank scores near
1.30.
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.euc m40.euc p100.euc
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.iso m40.iso p100.iso
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.cos
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
Kernels
m40.cos
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
p100.cos
0.0
0.2
0.4
0.6
0.8
1.0
Similarity Measures for Kernels and Architectures
Fig. 6: Similarity measures for Euclidean, IsoRank and Cosine distances for 12
arbitarily selected kernels.
12 Lim, Norris, Malony
displays instruction mix estimation error rates, calculated using mean squared
error, for MD, Backprop, and SPMV kernels as a function of matched kernels
(x-axis) with IsoRank scores between 1.00 to 1.30. Naming convention for each
kernel is as follows: 〈gpu arch.suite.app.kernel〉. In general, CUDAflow is able
to provide subgraph matching for arbitrary kernels through the IsoRank score
in addition to instruction mixes within a 8% margin of error. Note that since
relative dynamic performance is being estimated from static information, the
error rates will always be high.
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.jac m40.jac p100.jac
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.min m40.min p100.min
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
R.bfs.Z6Ker.1
R.hot.Z14ca.2
R.par.Z10su.3
R.pat.Z14dy.4
R.sra.Z6red.5
R.sra.Z4sra.6
R.sra.Z11sr.7
R.sra.Z11sr.8
S.sca.Z11bo.9
S.spm.Z22sp.10
S.spm.Z22sp.11
S.ste.Z13St.12
k80.man
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
Kernels
m40.man
R.b
fs.
Z6
Ke
r.1
R.h
ot.
Z1
4c
a.2
R.p
ar.
Z1
0s
u.3
R.p
at.
Z1
4d
y.4
R.s
ra.
Z6
red
.5
R.s
ra.
Z4
sra
.6
R.s
ra.
Z1
1s
r.7
R.s
ra.
Z1
1s
r.8
S.s
ca
.Z1
1b
o.9
S.s
pm
.Z2
2s
p.1
0
S.s
pm
.Z2
2s
p.1
1
S.s
te.
Z1
3S
t.1
2
p100.man
0.0
0.2
0.4
0.6
0.8
1.0
Similarity Measures for Kernels and Architectures
Fig. 7: Similarity measures for Jaccard, Minkowski and Manhattan distances for
12 arbitarily selected kernels.
Pairwise Matching of Kernels Figure 6 shows pairwise comparisons for 12
arbitrary selected kernels, comparing Euclidean (top), IsoRank (middle), and
Cosine distance (bottom) matching strategies, and GPU architectures (rows).
Figure 7 shows comparisons for the Jaccard measure, Minkowski, and Manhattan
distances for the same 12 kernels. Note that the distance scores were scaled to
0 and 1, where 0 indicates strong similarity and 1 denotes weak similarity. In
general, all similarity measures, with the exeception of IsoRank, is able to match
A Similarity Measure for GPU Kernel Subgraph Matching 13
against itself, as evident in the dark diagonal entries in the plots. However, this
demonstrates that using similarity measures in isolation alone is not sufficient
for performing subgraph matching for CUDA kernels.
S.
st
e.
Z1
3S
te
nc
ilK
S.
st
e.
Z1
3S
te
nc
ilK
R.
ba
c.
Z2
2b
pn
n_
la
y
R.
sr
a.
Z4
sr
ad
fii
lP
S.
sp
m
.Z
20
sp
m
v_
el
l
S.
sc
a.
Z6
re
du
ce
IfL
S.
sc
a.
Z6
re
du
ce
Id
L
S.
sc
a.
Z1
1b
ot
to
m
_s
S.
sc
a.
Z1
1b
ot
to
m
_s
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
sp
m
.Z
22
sp
m
v_
cs
r
R.
sr
a.
Z1
1s
ra
d_
cu
d
S.
sp
m
.Z
20
sp
m
v_
el
l
R.
pa
r.Z
24
no
rm
al
iz
R.
ho
t.Z
14
ca
lc
ul
at
R.
sr
a.
Z1
1s
ra
d_
cu
d
R.
bf
s.
Z6
Ke
rn
el
P4
N
R.
pa
r.Z
17
fin
d_
in
d
R.
pa
t.Z
14
dy
np
ro
c_
R.
sr
a.
Z6
re
du
ce
lii
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
sp
m
.Z
22
sp
m
v_
cs
r
R.
pa
r.Z
17
lik
el
ih
o
R.
nw
.Z
20
ne
ed
le
_c
R.
nw
.Z
20
ne
ed
le
_c
Kernels
0
10
20
30
40
50
60
70
80
di
st
an
ce
 (W
ar
d)
Dendrogram of Kernels (M40)
R.
nw
.Z
20
ne
ed
le
_c
R.
nw
.Z
20
ne
ed
le
_c
R.
sr
a.
Z6
re
du
ce
lii
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
st
e.
Z1
3S
te
nc
ilK
S.
st
e.
Z1
3S
te
nc
ilK
R.
ba
c.
Z2
2b
pn
n_
la
y
R.
sr
a.
Z4
sr
ad
fii
lP
S.
sc
a.
Z1
1b
ot
to
m
_s
S.
sc
a.
Z1
1b
ot
to
m
_s
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
sp
m
.Z
20
sp
m
v_
el
l
R.
sr
a.
Z1
1s
ra
d_
cu
d
S.
sp
m
.Z
22
sp
m
v_
cs
r
S.
sp
m
.Z
20
sp
m
v_
el
l
R.
pa
r.Z
17
lik
el
ih
o
R.
bf
s.
Z6
Ke
rn
el
P4
N
R.
pa
r.Z
17
fin
d_
in
d
R.
pa
t.Z
14
dy
np
ro
c_
R.
pa
r.Z
24
no
rm
al
iz
S.
sc
a.
Z6
re
du
ce
IfL
S.
sc
a.
Z6
re
du
ce
Id
L
R.
ho
t.Z
14
ca
lc
ul
at
R.
sr
a.
Z1
1s
ra
d_
cu
d
Kernels
0
10
20
30
40
50
60
70
di
st
an
ce
 (W
ar
d)
Dendrogram of Kernels (P100)
Fig. 8: Dendrogram of clusters for 26 kernels, comparing Maxwell (left) and Pas-
cal (right) GPUs.
Clustering of Kernels We wanted to identify classes of kernels, based on char-
acteristics such as instruction mixes, graph structures and distance measures.
The Ward variance minimization algorithm minimizes the total within-cluster
variance by finding a pair of clusters that leads to a minimum increase in a
weighted squared distances. The initial cluster distances in Ward’s minimum
variance method is defined as the squared Euclidean distance between points:
dij = d({Xi}, {Xj}) = ||Xi−Xj ||2. Figure 8 shows a dendrogram of clusters for
26 kernels calculated with Ward’s method all matched with Rodinia Particle-
filter sum kernel, comparing the Maxwell (left) and Pascal (right) GPUs, which
both have 4 edges and 2 vertices in their CFGs. sum kernel performs a scan
operation and is slightly memory intensive (∼26% on GPUs). As shown, our
tool is able to categorize kernels by grouping features, such as instruction mixes,
graph structures, and distance measures that show strong similarity. This figure
also demonstrates that different clusters can be formed on different GPUs for
the same kernel, where the hardware architecture may result in different cluster
of kernel classes.
Finally, we wanted to see if our technique could identify the same kernels
running on a different GPU. Figure 9 shows distance measures when comparing
three kernels across three GPUs, for a total of 9 comparisons, whereas Figure 10
shows pairwise comparisons for the same three kernels across 3 GPUs, for a total
14 Lim, Norris, Malony
of 27 comparisons (x-axis), considering pairwise comparisons in both directions
(e.g. sim(G1, G2) and sim(G2, G1)). Figure 9 displays patches of dark regions
in distance measures corresponding to the same kernel when compared across
different GPUs. As shown in Figure 10, our tool not only is able to group the
same kernel that was executed on different GPUs, as evident in the three gen-
eral categories of clusters, but also kernels that exhibited similar characteristics
when running on a particular architecture, such as instructions executed, graph
structures, and distance measures.
k80.R.par.Z10su
m40.R.par.Z10su
p100.R.par.Z10su
k80.R.pat.Z14dy
m40.R.pat.Z14dy
p100.R.pat.Z14dy
k80.S.sca.Z6red
m40.S.sca.Z6red
p100.S.sca.Z6red
Euclidean IsoRank Cosine
k8
0.R
.pa
r.Z
10
su
m4
0.R
.pa
r.Z
10
su
p1
00
.R.
pa
r.Z
10
su
k8
0.R
.pa
t.Z
14
dy
m4
0.R
.pa
t.Z
14
dy
p1
00
.R.
pa
t.Z
14
dy
k8
0.S
.sc
a.Z
6re
d
m4
0.S
.sc
a.Z
6re
d
p1
00
.S.
sca
.Z6
red
k80.R.par.Z10su
m40.R.par.Z10su
p100.R.par.Z10su
k80.R.pat.Z14dy
m40.R.pat.Z14dy
p100.R.pat.Z14dy
k80.S.sca.Z6red
m40.S.sca.Z6red
p100.S.sca.Z6red
Jaccard
k8
0.R
.pa
r.Z
10
su
m4
0.R
.pa
r.Z
10
su
p1
00
.R.
pa
r.Z
10
su
k8
0.R
.pa
t.Z
14
dy
m4
0.R
.pa
t.Z
14
dy
p1
00
.R.
pa
t.Z
14
dy
k8
0.S
.sc
a.Z
6re
d
m4
0.S
.sc
a.Z
6re
d
p1
00
.S.
sca
.Z6
red
Kernels
Minkowski
k8
0.R
.pa
r.Z
10
su
m4
0.R
.pa
r.Z
10
su
p1
00
.R.
pa
r.Z
10
su
k8
0.R
.pa
t.Z
14
dy
m4
0.R
.pa
t.Z
14
dy
p1
00
.R.
pa
t.Z
14
dy
k8
0.S
.sc
a.Z
6re
d
m4
0.S
.sc
a.Z
6re
d
p1
00
.S.
sca
.Z6
red
Manhattan
0.0
0.2
0.4
0.6
0.8
1.0
Similarity Measures for Kernels across Architectures
Fig. 9: Dendrogram of clusters for pairwise comparison for 3 kernels across 3
GPUs (9 total).
6.3 Discussion
These metrics can be used both for guiding manual optimizations and by com-
pilers or autotuners. For example, human optimization effort can focus on the
code fragments that are ranked high by kernel impact, but low by the goodness
metric. An autotuner can also use metrics such as the goodness metric to ex-
plore the space of optimization parameters more efficiently, such as by excluding
cases where we can predict a low value of the goodness metric without having to
execute and time the actual generated code. A benefit to end users (not included
in paper, due to space purposes) would be providing the ability to compare an
implementation against a highly optimized kernel. By making use of subgraph
matching strategy as well as instruction operations executed, CUDAflow is able
to provide a mechanism to characterize unseen kernels.
A Similarity Measure for GPU Kernel Subgraph Matching 15
k8
0.
R.
pa
r.Z
10
su
.m
40
.R
.p
ar
.Z
10
su
k8
0.
R.
pa
r.Z
10
su
.p
10
0.
R.
pa
r.Z
10
su
m
40
.R
.p
ar
.Z
10
su
.k
80
.R
.p
ar
.Z
10
su
p1
00
.R
.p
ar
.Z
10
su
.k
80
.R
.p
ar
.Z
10
su
m
40
.R
.p
ar
.Z
10
su
.p
10
0.
R.
pa
r.Z
10
su
p1
00
.R
.p
ar
.Z
10
su
.m
40
.R
.p
ar
.Z
10
su
k8
0.
R.
pa
r.Z
10
su
.k
80
.S
.s
ca
.Z
6r
ed
m
40
.R
.p
ar
.Z
10
su
.k
80
.S
.s
ca
.Z
6r
ed
p1
00
.R
.p
ar
.Z
10
su
.k
80
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
r.Z
10
su
.k
80
.R
.p
at
.Z
14
dy
m
40
.R
.p
ar
.Z
10
su
.k
80
.R
.p
at
.Z
14
dy
p1
00
.R
.p
ar
.Z
10
su
.k
80
.R
.p
at
.Z
14
dy
k8
0.
R.
pa
r.Z
10
su
.m
40
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
r.Z
10
su
.p
10
0.
S.
sc
a.
Z6
re
d
m
40
.R
.p
ar
.Z
10
su
.p
10
0.
S.
sc
a.
Z6
re
d
p1
00
.R
.p
ar
.Z
10
su
.p
10
0.
S.
sc
a.
Z6
re
d
m
40
.R
.p
ar
.Z
10
su
.m
40
.S
.s
ca
.Z
6r
ed
p1
00
.R
.p
ar
.Z
10
su
.m
40
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
r.Z
10
su
.m
40
.R
.p
at
.Z
14
dy
k8
0.
R.
pa
r.Z
10
su
.p
10
0.
R.
pa
t.Z
14
dy
m
40
.R
.p
ar
.Z
10
su
.m
40
.R
.p
at
.Z
14
dy
p1
00
.R
.p
ar
.Z
10
su
.m
40
.R
.p
at
.Z
14
dy
m
40
.R
.p
ar
.Z
10
su
.p
10
0.
R.
pa
t.Z
14
dy
p1
00
.R
.p
ar
.Z
10
su
.p
10
0.
R.
pa
t.Z
14
dy
k8
0.
S.
sc
a.
Z6
re
d.
k8
0.
R.
pa
r.Z
10
su
k8
0.
S.
sc
a.
Z6
re
d.
m
40
.R
.p
ar
.Z
10
su
k8
0.
S.
sc
a.
Z6
re
d.
p1
00
.R
.p
ar
.Z
10
su
k8
0.
R.
pa
t.Z
14
dy
.k
80
.R
.p
ar
.Z
10
su
k8
0.
R.
pa
t.Z
14
dy
.m
40
.R
.p
ar
.Z
10
su
k8
0.
R.
pa
t.Z
14
dy
.p
10
0.
R.
pa
r.Z
10
su
m
40
.S
.s
ca
.Z
6r
ed
.k
80
.R
.p
ar
.Z
10
su
p1
00
.S
.s
ca
.Z
6r
ed
.k
80
.R
.p
ar
.Z
10
su
p1
00
.S
.s
ca
.Z
6r
ed
.m
40
.R
.p
ar
.Z
10
su
p1
00
.S
.s
ca
.Z
6r
ed
.p
10
0.
R.
pa
r.Z
10
su
m
40
.S
.s
ca
.Z
6r
ed
.m
40
.R
.p
ar
.Z
10
su
m
40
.S
.s
ca
.Z
6r
ed
.p
10
0.
R.
pa
r.Z
10
su
m
40
.R
.p
at
.Z
14
dy
.k
80
.R
.p
ar
.Z
10
su
p1
00
.R
.p
at
.Z
14
dy
.k
80
.R
.p
ar
.Z
10
su
m
40
.R
.p
at
.Z
14
dy
.m
40
.R
.p
ar
.Z
10
su
m
40
.R
.p
at
.Z
14
dy
.p
10
0.
R.
pa
r.Z
10
su
p1
00
.R
.p
at
.Z
14
dy
.m
40
.R
.p
ar
.Z
10
su
p1
00
.R
.p
at
.Z
14
dy
.p
10
0.
R.
pa
r.Z
10
su
m
40
.R
.p
at
.Z
14
dy
.k
80
.S
.s
ca
.Z
6r
ed
p1
00
.R
.p
at
.Z
14
dy
.k
80
.S
.s
ca
.Z
6r
ed
m
40
.R
.p
at
.Z
14
dy
.k
80
.R
.p
at
.Z
14
dy
p1
00
.R
.p
at
.Z
14
dy
.k
80
.R
.p
at
.Z
14
dy
m
40
.R
.p
at
.Z
14
dy
.p
10
0.
R.
pa
t.Z
14
dy
p1
00
.R
.p
at
.Z
14
dy
.m
40
.R
.p
at
.Z
14
dy
m
40
.R
.p
at
.Z
14
dy
.m
40
.S
.s
ca
.Z
6r
ed
p1
00
.R
.p
at
.Z
14
dy
.m
40
.S
.s
ca
.Z
6r
ed
m
40
.R
.p
at
.Z
14
dy
.p
10
0.
S.
sc
a.
Z6
re
d
p1
00
.R
.p
at
.Z
14
dy
.p
10
0.
S.
sc
a.
Z6
re
d
m
40
.S
.s
ca
.Z
6r
ed
.m
40
.R
.p
at
.Z
14
dy
m
40
.S
.s
ca
.Z
6r
ed
.p
10
0.
R.
pa
t.Z
14
dy
p1
00
.S
.s
ca
.Z
6r
ed
.m
40
.R
.p
at
.Z
14
dy
p1
00
.S
.s
ca
.Z
6r
ed
.p
10
0.
R.
pa
t.Z
14
dy
k8
0.
S.
sc
a.
Z6
re
d.
m
40
.R
.p
at
.Z
14
dy
k8
0.
S.
sc
a.
Z6
re
d.
p1
00
.R
.p
at
.Z
14
dy
k8
0.
R.
pa
t.Z
14
dy
.m
40
.R
.p
at
.Z
14
dy
k8
0.
R.
pa
t.Z
14
dy
.p
10
0.
R.
pa
t.Z
14
dy
k8
0.
S.
sc
a.
Z6
re
d.
m
40
.S
.s
ca
.Z
6r
ed
k8
0.
S.
sc
a.
Z6
re
d.
p1
00
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
t.Z
14
dy
.k
80
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
t.Z
14
dy
.m
40
.S
.s
ca
.Z
6r
ed
k8
0.
R.
pa
t.Z
14
dy
.p
10
0.
S.
sc
a.
Z6
re
d
k8
0.
S.
sc
a.
Z6
re
d.
k8
0.
R.
pa
t.Z
14
dy
m
40
.S
.s
ca
.Z
6r
ed
.k
80
.R
.p
at
.Z
14
dy
p1
00
.S
.s
ca
.Z
6r
ed
.k
80
.R
.p
at
.Z
14
dy
m
40
.S
.s
ca
.Z
6r
ed
.k
80
.S
.s
ca
.Z
6r
ed
p1
00
.S
.s
ca
.Z
6r
ed
.k
80
.S
.s
ca
.Z
6r
ed
m
40
.S
.s
ca
.Z
6r
ed
.p
10
0.
S.
sc
a.
Z6
re
d
p1
00
.S
.s
ca
.Z
6r
ed
.m
40
.S
.s
ca
.Z
6r
ed
Kernels
0
20
40
60
80
100
120
140
160
di
st
an
ce
 (W
ar
d)
Dendrogram of Kernels with GPUs
Fig. 10: Dendrogram of clusters for pairwise comparison for 3 kernels across 3
GPUs (27 total).
7 Conclusion
We have presented CUDAflow, a control-flow-based methodology for analyzing
the performance of CUDA applications. We combined static binary analysis with
dynamic profiling to produce a set of metrics that not only characterizes the ker-
nel by its computation requirements (memory or compute bound), but also pro-
vides detailed insights into application performance. Specifically, we provide an
intuitive visualization and metrics display, and correlate performance hotspots
with source line and file information, effectively guiding the end user to loca-
tions of interest and revealing potentially effective optimizations by identifying
similarities of new implementations to known, autotuned computations through
subgraph matching. We implemented this new methodology and demonstrated
its capabilities on SHOC and Rodinia applications.
Future work includes incorporating memory reuse distance statistics of a ker-
nel to characterize and help optimize the memory subsystem and compute/mem-
ory overlaps on the GPU. In addition, we want to generate robust models that
will discover optimal block and thread sizes for CUDA kernels for specific input
sizes without executing the application [18]. Last, we are in the process of de-
veloping an online web portal [7,26] that will archive a collection of control flow
graphs for all known GPU applications. For instance, the web portal would be
able to make on-the-fly comparisons across various hardware resources, as well as
other GPU kernels, without burdening the end user with hardware requirements
or software package installations, and will enable more feature rich capabilities
when reporting performance metrics.
16 Lim, Norris, Malony
References
1. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey,
J., Tallent, N.R.: HPCToolkit: Tools for performance analysis of optimized parallel
programs. Concurrency and Computation: Practice and Experience 22(6), 685–701
(2010)
2. Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with
flow and context sensitive profiling. ACM Sigplan Notices 32(5), 85–96 (1997)
3. Ball, T., Larus, J.R.: Optimally profiling and tracing programs. ACM Transactions
on Programming Languages and Systems (TOPLAS) 16(4), 1319–1360 (1994)
4. Bo¨hm, C., Jacopini, G.: Flow diagrams, turing machines and languages with only
two formation rules. Communications of the ACM 9(5), 366–371 (1966)
5. Borgelt, C., Berthold, M.R.: Mining molecular fragments: Finding relevant sub-
structures of molecules. In: Proceedings of the IEEE International Conference on
Data Mining. pp. 51–58. IEEE (2002)
6. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.:
Rodinia: A benchmark suite for heterogeneous computing. In: Workload Charac-
terization, 2009. IISWC 2009. IEEE International Symposium on. pp. 44–54. IEEE
(2009)
7. Collective Knowledge (CK), http://cknowledge.org
8. Csardi, G., Nepusz, T.: The iGraph software package for complex network research
9. Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tip-
paraju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark
suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on
Graphics Processing Units. pp. 63–74. ACM (2010)
10. Allinea DDT, http://www.allinea.com/products/ddt
11. Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., Yalamanchili, S.: SIMD
re-convergence at thread frontiers. In: Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture. pp. 477–488. ACM (2011)
12. Farooqui, N., Kerr, A., Eisenhauer, G., Schwan, K., Yalamanchili, S.: Lynx: A
dynamic instrumentation system for data-parallel applications on GPGPU archi-
tectures. In: International Symposium on Performance Analysis of Systems and
Software (ISPASS). pp. 58–67. IEEE (2012)
13. Gonzales, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley (1993)
14. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the pres-
ence of isomorphism. In: Data Mining, 2003. ICDM 2003. Third IEEE International
Conference on. pp. 549–552. IEEE (2003)
15. Koutra, D., Vogelstein, J.T., Faloutsos, C.: DeltaCon: A principled massive-graph
similarity function. SIAM
16. Lim, R., Carrillo-Cisneros, D., Alkowaileet, W., Scherson, I.: Computationally ef-
ficient multiplexing of events on hardware counters. In: Linux Symposium (2014)
17. Lim, R., Malony, A., Norris, B., Chaimov, N.: Identifying optimization opportuni-
ties within kernel execution in GPU codes. In: Euro-Par 2015: Parallel Processing
Workshops. Springer (2015)
18. Lim, R., Norris, B., Malony, A.: Autotuning GPU kernels via static and predictive
analysis. In: Parallel Processing (ICPP), 2017 46th International Conference on.
pp. 523–532. IEEE (2017)
19. Marin, G., Dongarra, J., Terpstra, D.: MIAMI: A framework for application per-
formance diagnosis. In: Performance Analysis of Systems and Software (ISPASS),
2014 IEEE International Symposium on. pp. 158–168. IEEE (2014)
A Similarity Measure for GPU Kernel Subgraph Matching 17
20. Miller, B.P., Callaghan, M.D., Cargille, J.M., Hollingsworth, J.K., Irvin, R.B., Kar-
avanic, K.L., Kunchithapadam, K., Newhall, T.: The paradyn parallel performance
measurement tool. Computer 28(11), 37–46 (1995)
21. Nvidia Visual Profiler, https://developer.nvidia.com/
nvidia-visual-profiler
22. Sabne, A., Sakdhnagool, P., Eigenmann, R.: Formalizing structured control flow
graphs. In: Languages and Compilers for Parallel Computing (LCPC). vol. 10136.
Lecture Notes in Computer Science (2016)
23. Sarkar, V.: Determining average program execution times and their variance. In:
ACM SIGPLAN Notices. vol. 24, pp. 298–312. ACM (1989)
24. Shende, S.S., Malony, A.D.: The TAU parallel performance system. International
Journal of High Performance Computing Applications 20(2), 287–311 (2006)
25. Singh, R., Xu, J., Berger, B.: Pairwise global alignment of protein interaction net-
works by matching neighborhood topology. In: Research in computational molec-
ular biology. pp. 16–31. Springer (2007)
26. Sreepathi, S., Grodowitz, M., Lim, R., Taffet, P., Roth, P.C., Meredith, J., Lee,
S., Li, D., Vetter, J.: Application characterization using Oxbow toolkit and PADS
infrastructure. In: Proceedings of the 1st International Workshop on Hardware-
Software Co-Design for High Performance Computing. pp. 55–63. IEEE Press
(2014)
27. Williams, M.H., Ossher, H.: Conversion of unstructured flow diagrams to struc-
tured form. The Computer Journal 21(2), 161–167 (1978)
28. Wu, H., Diamos, G., Li, S., Yalamanchili, S.: Characterization and transformation
of unstructured control flow in GPU applications. In: 1st International Workshop
on Characterizing Applications for Heterogeneous Exascale Systems (2011)
29. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Data Min-
ing, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. pp.
721–724. IEEE (2002)
30. Zhang, F., D’Hollander, E.H.: Using hammock graphs to structure programs. IEEE
Transactions on Software Engineering 30(4), 231–245 (2004)
