A High-Throughput Solver for Marginalized Graph Kernels on GPU by Tang, Yu-Hang et al.
A High-Throughput Solver for Marginalized Graph
Kernels on GPU
Yu-Hang Tang, Oguz Selvitopi, Doru Thom Popovici, Aydın Buluc¸
Computational Research Division, Lawrence Berkeley National Laboratory
Email: {tang, roselvitopi, dtpopovici, abuluc}@lbl.gov
Abstract—We present the design and optimization of a linear
solver on General Purpose GPUs for the efficient and high-
throughput evaluation of the marginalized graph kernel between
pairs of labeled graphs. The solver implements a preconditioned
conjugate gradient (PCG) method to compute the solution to
a generalized Laplacian equation associated with the tensor
product of two graphs. To cope with the gap between the
instruction throughput and the memory bandwidth of current
generation GPUs, our solver forms the tensor product linear
system on-the-fly without storing it in memory when performing
matrix-vector dot product operations in PCG. Such on-the-fly
computation is accomplished by using threads in a warp to
cooperatively stream the adjacency and edge label matrices of
individual graphs by small square matrix blocks called tiles,
which are then staged in registers and the shared memory for
later reuse. Warps across a thread block can further share tiles
via the shared memory to increase data reuse. We exploit the
sparsity of the graphs hierarchically by storing only non-empty
tiles using a coordinate format and nonzero elements within each
tile using bitmaps. Besides, we propose a new partition-based
reordering algorithm for aggregating nonzero elements of the
graphs into fewer but denser tiles to improve the efficiency of
the sparse format.
We carry out extensive theoretical analyses on the graph tensor
product primitives for tiles of various density and evaluate their
performance on synthetic and real-world datasets. Our solver
delivers three to four orders of magnitude speedup over existing
CPU-based solvers such as GraKeL and GraphKernels. The
capability of the solver enables kernel-based learning tasks at
unprecedented scales.
I. INTRODUCTION
Recent advances in machine learning have sparked unique op-
portunities for building artificial intelligence on graphs, which
is a versatile data structure for representing non-sequential data
of discrete nature. As illustrated by Figure 1, a distinction of
graph-based discrete data from vector-based discretizable data
is that the former consists of indivisible elements that must be
inserted or withdrawn atomically. In contrast, the latter consist
of discretized samples drawn from a continuous signal at
tunable resolutions. Consequently, graph data does not trivially
permit interpolation, convolution, and inner product, which
are the operations commonly used in feature extraction. As
a result, special care must be taken to generalize machine
learning algorithms that operate on fixed-length feature vectors
and uniform grids to their graph-based counterparts.
One way to interface graph data to machine learning al-
gorithms is to apply the kernel trick. A graph kernel in
this context refers to a function that performs inner product
operations between graphs after implicitly transforming them
Fig. 1. Image and voice recordings are discretizable objects in the sense
that numeric representations for them can be acquired by sampling at a
certain resolution, which is a tunable parameter. In contrast, molecules and
social networks are non-sequential and discrete objects, and thus are better
represented by graphs.
D I S C R E T I Z A B L E
mammal
animal
cat
bear
whale
fish
water
vertebra
D I S C R E T E
SOUND PDE VOXELIMAGE TRAFFIC SOCIALMOLECULE SEMANTICS
into high- and even infinite-dimensional feature vectors. A
valid graph kernel must be positive definite, meaning that
the feature space must be a reproducing kernel Hilbert space.
The inner product thus naturally induces a measure of graph
similarity using the cosine of angles in the feature space.
Graph kernels allow a wide range of kernel-based learning
methods, e.g. support vector machine, Gaussian process re-
gression, spectral clustering, principal component analysis, to
operate straightforwardly on graph-based datasets.
The marginalized graph kernel [1] is a powerful tool for
graph similarity comparison between labeled and weighted
graphs of arbitrary size and topology. As illustrated in Fig-
ure 2, the kernel constructs a feature space containing infinitely
many dimensions, each of which represents a path on a graph.
The weight of a feature is set equal to the probability of
its path in a Markovian random walk process induced by
the graph’s adjacency matrix. The overall similarity is then
defined as the expectation of partial similarities between all
pairs of same-length paths, each of which is computed as
the product of a sequence of node-by-node and edge-by-edge
comparisons. Besides computing an overall similarity score
between two graphs, the kernel also defines a measure of node-
wise similarity, which is the expectation of the similarities
between all pairs of paths originating from a given pair of
nodes. The node-wise similarity is particularly useful for
learning tasks involving the transfer of node labels. The kernel
has found successful application in tasks such as prediction of
molecular energy [2] and protein function [3].
This paper focuses on the efficient and high-throughput
computation of the marginalized graph kernel, which is critical
for computing the pairwise similarity matrix between all pairs
of graphs, a task that occurs repeatedly when training many
ar
X
iv
:1
91
0.
06
31
0v
4 
 [c
s.D
C]
  2
5 F
eb
 20
20
Fig. 2. Unlike kernels which compute the inner product between fixed-length explicit feature vectors, the marginalized graph kernel computes the inner
product between labeled graphs. Such inner product is defined as the expectation of the inner products between all the simultaneous random walk paths on a
pair of graphs, and can be efficiently computed by solving a linear system associated with the tensor product of the two graphs.
E U C L I D E A N
D
I
S
T
A
N
C
E
A
N
G
L
E
G R A P H  K E R N E L  ( R K H S )
i n d i v i d u a l
g ra p h s
s i m u l t a n e o u s
ra n d o m  w a l k
p a t h
s i m i l a r i t y
P R O B A B I L I S T I C  I N T E R P R E T A T I O N L I N E A R  A L G E B R A  I N T E R P R E T A T I O N
d e g re e ve r t e x  l a b e l a d j a c e n c y e d g e  l a b e l
S P D  s y s t e m
to  s o l ve
kernel-based methods. As will be shown later in Section II-B
and Appendix A , each marginalized graph kernel evaluation
between a pair of graphs involves solving a linear system
whose size is the product of the number of nodes of the two
graphs. To obtain a pairwise similarity matrix for a dataset
of 2000 graphs, each with 100 nodes, we need to solve a
million 104 × 104 linear systems. Thus, a high-performance
and high-throughput solver is crucial for applying and scaling
the marginalized graph kernel to large datasets.
In this paper, we present a series of algorithms and opti-
mizations, such as on-the-fly Kronecker product matrix-vector
multiplication, partition-based graph reordering, and sparsity
exploitation, to accelerate the marginalized graph kernel com-
putation. The synergy of the algorithms leads to a solver that
achieves a significant performance boost over existing pack-
ages on general-purpose graphics processing units (GPGPUs).
The rest of the paper is organized as follows. In Section II,
we briefly review related mathematical background knowl-
edge, introduce the formulation of the marginalized graph
kernel, and carry out a preliminary analysis to identify design
challenges. In Section III, we explore several design options of
a dense Kronecker product matrix-vector multiplication primi-
tive and identify the optimal one through Roofline analyses and
microbenchmarking. In Section IV, we examine data structure
designs, graph reordering algorithms, and sparse Kronecker
product matrix-vector multiplication primitives in order to
exploit the sparsity in the graph. In Section V, we present
data sharing and load balancing approaches for scaling the
algorithm onto entire GPUs. Benchmark datasets and results
are given in section VI and section VII, respectively. We
discuss the connections between our project and previous ones
in section VIII and conclude the paper in section IX.
II. THEORETICAL BACKGROUND
A. Preliminaries and Notations
We use lower case letters in bold font, e.g. a, to denote vectors,
and upper case letters in bold font, e.g. A, to denote matrices.
By default, we assume vectors are column vectors. We use
diag(a) to denote a diagonal matrix whose diagonal elements
are specified by a. We use vertex and node interchangeably to
refer to the fundamental units of graphs.
Definition 1: Undirected graph
An undirected graph G is a discrete structure consisting of a
set of uniquely-indexed vertices V = {v1, v2, . . . , vn} and a
set of undirected edges E ⊂ V × V . The vertices and edges
may be labeled using elements from label sets Σv and Σe,
respectively.
Definition 2: Weighted graph
In a weighted graph, each edge (vi, vj) is associated with a
non-negative weight wij . In undirected graphs wij = wji.
wij = 0 if vi and vj are not connected by an edge. An
unweighted graph can be regarded as a specialized weighted
graph where wij = 1 between each pair of (vi, vj) connected
by an edge and 0 elsewhere.
Definition 3: Walk on graph
Two vertices are neighbors if they are connected by an edge.
A walk on a graph is a sequence of vertices and edges such
that all consecutive pairs of vertices are neighbors.
Definition 4: Adjacency matrix
The adjacency matrix of a graph of n vertices is a matrix A ∈
Rn×n with Aij = wij . The adjacency matrices of undirected
graphs are symmetric since wij ≡ wji.
Definition 5: Edge label matrix
The edge label matrix of a graph of n vertices is a matrix
E ∈ Σen×n with Eij = eij . E has the same symmetry and
sparsity pattern with A.
Definition 6: Kronecker product
Given matrices A ∈ Rn×m and B ∈ Rn′×m′ , the Kronecker
product P = A⊗B ∈ Rnn′×mm′ is defined as:
P = A⊗B :=

A1,1B A1,2B . . . A1,mB
A2,1B A2,2B . . . A2,mB
...
...
. . .
...
An,1B An,2B . . . An,mB

To better visualize the correspondence between an element
of the Kronecker product matrix and its source elements from
the operand matrices, we use a quadruple index notation
Pii′,jj′ , which is located at the (i × n + i′)-th row and
(j ×m + j′)-th column of P, to denote the element formed
by Aij ·Bi′j′ . Similarly, for a vector p = a⊗ b, we use pii′
to denote its (i × n + i′)-th component which is formed by
ai · bi′ .
Definition 7: Generalized Kronecker product
Given a set S whose elements are necessarily numeric, a
generalized Kronecker product P = A
κ
⊗ B between two
matrices A ∈ Sn×m and B ∈ Sn′×m′ with respect to a kernel
κ : S × S → R+, is a real matrix P ∈ Rnn′×mm′ where
Pii′,jj′ = κ(Aij ,Bi′j′). In other words, κ is a generalization
of the real number multiplication operation as used in the
standard Kronecker product on S.
Definition 8: Hadamard (element-wise) product
The element-wise product, also known as the Hadamard
product, between two matrices of the same size A,B ∈ Rm×n
is another matrix AB ∈ Rm×n with (AB)ij := Aij Bij .
B. Marginalized Graph Kernel
We have previously shown [2] that the computation to apply
marginalized graph kernel between two labeled graphs G and
G′ can be simplified into solving a linear system involving a
generalized Laplacian of the tensor product graph G⊗G′:
KMG(G,G
′) = pT×
(
D×V−1× −A× E×
)−1
D×q×. (1)
Here
p× := p ⊗ p′ is the starting probability of a Markovian
random walk process from each node of the product graph;
q× := q⊗q′ is the stopping probability of the random walk
process on each node of the product graph;
A× := A⊗A′ is the adjacency matrix of the product graph;
D× := diag(d ⊗ d′) is the degree matrix of the product
graph, while di =
∑
jAij + qi is the degree of node i;
V× := diag
(
v
κ
⊗ v′
)
is a diagonal matrix set by the
generalized Kronecker product with respect to a vertex base
kernel κv : Σv × Σv → R+, while v and v′ contains the
vertex labels of G and G′, respectively;
E× := E
κ
⊗E′ is the generalized Kronecker product between
edge label matrices E and E′ with respect to an edge base
kernel κe : Σe × Σe → R+.
Our extended preprint [4] gives a detailed derivation of Equa-
tion (1). Our earlier work [2] contains an example of the rules
for determining the specific values for p, q, V, A, and E.
D× and V× are complete diagonal matrices, and A× and
E× only have non-zero off-diagonal elements. The linear
system in Equation (1) is symmetric and positive definite as
long as the base kernels κv(·, ·) and κe(·, ·) themselves are
positive definite with ranges within (0, 1] and [0, 1], respec-
tively. The actual arithmetics involved to compute κv and κe
strongly affect the design of efficient matrix-vector multiplica-
tion primitives because they determine the computational costs
to generate E× and V× as detailed in Section III.
A degenerate case worth noting is when both the graph
nodes and edges are unlabeled. Such unlabeled graphs elimi-
nate the use of base kernels as well as the V× and E× matrices
from Equation (1). Consequently, Equation (1) gets simplified
into
KRW(G,G
′) = pT× (D× −A×)−1D×q×. (2)
Equation (2) is essentially the random walk graph kernel
proposed by Vishwanathan et al. [5]. We denote it as the
unlabeled graph kernel, and use it as one of the two model
problems in performance modeling and algorithm design.
C. Preconditioned Conjugate Gradient Method
A variety of methods such as conjugate gradient (CG), spectral
decomposition, fixed-point iteration, and generalized Sylvester
equation can be used to solve the linear system in Equation (1)
[5], [6]. Among those, spectral decomposition delivers the best
performance if the edges are unlabeled or labeled with a small
set of distinct elements. CG is favorable in many real-world
applications where the edges are labeled using larger and more
complex attribute sets. For example, the edges can be labeled
by interatomic distances that span some continuous interval of
R+ when the graphs represent 3D structures of molecules [2].
In this case, the spectral decomposition method is no longer
advantageous due to the need for looping over all pairs of
distinct labels.
Algorithm 1 illustrates the application of CG, together with
a diagonal preconditioner, for solving Equation (1). Being
formulated as an exact solver for symmetric and positive
definite linear systems that iteratively minimizes a residual
vector in successive orthogonal directions, the method has in
practice often being used as an iterative solver because the
convergence can be achieved quickly due to the orthogonal-
ization of successive search directions.
D. Preliminary Roofline Analysis
Fig. 3. A Roofline analysis shows that the Kronecker product matrix-vector
multiplication operation in the conjugate gradient solver for the marginalized
graph kernel is memory-bound if implemented naı¨vely on the Volta V100
GPU. A possible solution is to regenerate the product matrix on-the-fly without
storing it. Hardware metrics provided by [7]. Vertical lines correspond to the
on-the-fly solver that uses each element r = 4, 16, 64 times when computing
the matrix-vector product for solving Equation (2) where E = 0, F = 4, X =
3.
Pea
k Sh
ared
Ban
dwi
dth
1/8 1/4 1/2 1 2 4 8 16 32 64 128 256
Arithmetic Intensity (SP FLOPS/byte)
1
2
4
8
16
32
64
128
256
512
Pe
rfo
rm
an
ce
pe
rS
M
(G
FL
OP
S)
GV100
Pea
k Gl
oba
l Ba
ndw
idth
Na
ïve
L × On
-th
e-
Fl
y
c
=
4
On
-th
e-
Fl
y
c
=
16
On
-th
e-
Fl
y
c
=
64
Peak SP
No FMA
The Kronecker product matrix-vector multiplication opera-
tion (A ⊗ A′)  (E κ⊗ E′) · p, as highlighted on line 10 of
Algorithm 1, has the highest order of asymptotic complexity
of O(N2) for a N × N system, and is the hotspot of
the CG algorithm. Hence, we construct a Roofline model
Algorithm 1 Preconditioned conjugate gradient algorithm for
the marginalized graph kernel. Legend: : off-diagonal
symmetric matrix-vector multiplication, : diagonal matrix-
vector multiplication, T : vector dot product, + : (scaled)
vector addition.
1 function CG4GK(d,d′,v,v′,A,A′,E,E′, q,q′)
2 M← diag
[
( d⊗ d′) (v κ⊗ v′)−1
]
+
3 x ← 0 +
4 r ← (d⊗ d′) · (q⊗ q′)
5 z ← v κ⊗ v′ +
6 p← z +
7 ρ ← rTz T
8 repeat
9 a ← ( d⊗ d′) (v κ⊗ v′)−1 · p
10 +(A⊗A′) (E κ⊗E′) · p
11 α← ρ/(pTa) T
12 x ← x+ αp +
13 r ← r− αa +
14 z ←M−1r +
15 ρ′← rTz T
16 β ← ρ′/ρ
17 p← z+ βp +
18 ρ ← ρ′
19 until rTr < 
20 return x
[8] to estimate the potential profitability of accelerating this
operation with GPUs. We will first focus on fully connected
graphs and show later in Section IV how sparsity and locality
in the graph can be exploited to improve performance further.
Motivated by real-world applications that we encounter as
exemplified in the Appendix of our extended preprint [4], we
use an abstract model for the storage and arithmetic cost of
the computation. In this model, we assume that a floating-
point number occupies F bytes, an edge label occupies E
bytes, and a function evaluation of κe(·, ·) costs X floating-
point operations.
In a naı¨ve implementation, the product matrix L×
.
= (A⊗
A′) (E κ⊗ E′) is precomputed beforehand and reused in the
CG loop. Given a pair of graphs each with n and m nodes,
respectively, the naı¨ve solver needs to load a floating-point
matrix L× of nm×nm elements and a right-hand side vector
p of nm elements, and perform n2m2 floating point fused
multiply-additions. Hence, the arithmetic intensity of the naı¨ve
solver is 2n2m2/(n2m2F + nmF ) → 2F , or 12 in single
precision mode. On the Volta GPU architecture, the solver
is severely memory-bound, achieving at most 3% utilization
of the peak floating-point performance as predicted by the
Roofline plot in Figure 3.
A further disadvantage of the naı¨ve approach is that the
product matrix takes up a prohibitively large amount of storage
space. Such behavior could limit both the size of the graphs as
well as the concurrency of pairwise graph kernel computations
that a GPU can accommodate.
In Algorithm 2, we outline an on-the-fly Kronecker product
Algorithm 2 A high-level outline of the on-the-fly Kronecker
product matrix-vector multiplication (XMV) algorithm.
1 function XMV((A,A′), (E,E′), a, p)
2 streaming length-c chunks in A,E OUTER LOOP
3 streaming length-c chunks in A′,E′ INNER LOOP
4 for each eij in first chunk do
5 for each e′i′j′ in second chunk do
6 aii′ ← aii′ + κ(eij , e′i′j′) · pjj′
matrix-vector multiplication (XMV) algorithm that directly
computes the inner product L× · p, instead of L×, in an
attempt to trade data movement with arithmetic operations.
The algorithm takes advantage of the Kronecker product
structure of L× to repeatedly recreate the matrix without
storing it. It is achieved by streaming elements from the
pair of individual graphs and caching them to perform the
computation. We can perform c2 edge kernel evaluations using
only fast memory and registers for every length-c chunk of
edge weight/label pairs streamed from the graphics mem-
ory. With a double loop structure, which amortizes the cost
of loading one of the two graphs, the on-the-fly approach
can achieve an arithmetic intensity of c
2X
c(E+F ) =
cX
E+F , or
specifically 34c in the unlabeled case. As shown in Figure 3,
tuning c can thus be used as a straightforward approach to
achieve the highest utilization of the computing power on the
Volta GPU. Note that regenerating the product matrix only
increases the constant factor, but does not alter the order of the
computational complexity of the matrix-vector multiplication
operation. Therefore, this approach is profitable as long as the
gain in instruction throughput outweighs the added cost of
base kernel evaluations.
III. ON-THE-FLY KRONECKER PRODUCT FORMATION AND
MATRIX-VECTOR MULTIPLICATION
For dense adjacency and edge label matrices of fully connected
graphs, we propose three concrete implementations of the on-
the-fly XMV algorithm as outlined in Algorithm 2 on the Volta
GPU. All three primitives adopt a warp-synchronous high-
throughput programming model, where every 32 consecutive
threads within a warp work cooperatively on a pair of graphs.
A naı¨ve implementation that uses a precomputed product
matrix L× is described in our extended preprint [4]. We
introduce methods to exploit the sparsity of the graphs in
Section IV, and methods for sharing data and work within
a thread block in Section V.
We denote the off-chip DDR or HBM memory attached
to a GPU as the device memory and the on-chip addressable
SRAM as the shared memory. A global load or store operation
accesses the device memory, while a shared load or store
operation accesses the shared memory.
A. Shared Tiling
Our first implementation, the shared tiling primitive, uses the
shared memory as a staging area to reduce the usage of global
load instructions and device memory bandwidth. Figure 3
shows that it is much easier to attain performance close to
Fig. 4. In a tile-based on-the-fly Kronecker matrix-vector multiplication
primitive, the product matrix is regenerated and multiplied with the right-
hand side vector by streaming and caching graphs in small pieces called tiles
in a double loop.
I ǞǞȨǚ
L ǣ ǖ Ȩ
I ǞǞȨǚ
NȨǔǘ
L ǣ ǖ Ȩ
OǗǘȨǚ
I ǘ Ȩǚȳǘ ǣ ǝǞ
NȨǔǘ
OǗǘȨǚ
I ǘ Ȩǚȳǘ ǣ ǝǞ
G: OU T E R LO O P G′ : I N N E R LO O P
E
D
I
S
D
N
A
H
T
F
E
L
E
D
I
S
D
N
A
H
T
H
G
I
R
the theoretical peak on the Volta GPU by loading data from
the shared memory, which can provide more than 104 GB/s
bandwidth.
As shown in Figure 4, the shared tiling primitive streams
t×r tiles and the corresponding right-hand side elements from
the device memory to the shared memory for computation.
The tiles and the right-hand side elements are all loaded
cooperatively by a warp to ensure coalesced access. Among
the threads in a warp, the computation for a pair of tiles is
parallelized along the rows of the product matrix in a round-
robin manner. The work is serialized among the columns
within each thread. In the actual code, we choose t = 8 and
explicitly unrolled the row loops by a factor of two to obtain
an instruction-level parallelism of two on warps of 32 threads.
B. Register Blocking
Our second implementation, the register blocking primitive,
uses the register file to stage and reuse matrix elements. While
the work is parallelized along the rows in the same way as the
shared tiling primitive, each thread here will independently
stream length-r chunks from the rows that it owns to compute
r2 elements of the product matrix. Due to the synchronous
execution behavior within CUDA thread warps, threads in a
warp can still share right-hand side elements via the shared
memory because the march across the columns is lock-stepped
between consecutive t×r blocks. The primitive is simpler than
the shared tiling primitive, but causes a higher register pressure
and may generate unnecessary global memory transactions
depending on the value of r.
C. Combining Shared Tiling and Register Blocking
Our third implementation combines shared tiling and regis-
ter blocking. This tiling-blocking primitive aims to reduce
shared and global memory transactions while simultaneously
reducing register pressure. Here, a t× t tile is first cached in
the shared memory, and then further staged in registers over
r-element chunks. It can be implemented easily by placing
compiler directives to unroll the inner loops over column
indices.
D. Performance Analysis
From Figure 5, we can see that the tiling-blocking primitive
performs the best in terms of time-to-solution. It also achieves
the best FLOPS efficiency, defined as the ratio between
the actual throughput of floating point operations and the
theoretical peak after adjusting for FMA percentage. Hence,
the tiling-blocking primitive with t = 8 and r = 8 is
chosen as the building block for subsequent kernels with more
optimizations. We denote the 8 × 8 square tiles as octiles
hereafter. The shared tiling primitive and the register blocking
primitive performed nearly equally well, yet was not able to
achieve the best performance. The shared tiling primitive is
unsurprisingly bound by the shared memory throughput as
indicated by the measured shared memory bandwidth utiliza-
tion and the Roofline model. The register blocking primitive
is bound by global memory throughput when r is small, yet
suffers from register spilling right before it reaches the top
of the Roofline model with r = 24. Additional tests on a
Titan X Pascal graphics card indicate that the shared tiling
primitive performs better than the register blocking primitive
on accelerator equipped with GDDR memories, but the tiling-
blocking primitive still provides the best performance with
most balanced utilization of hardware resources.
IV. EXPLICIT SPARSITY EXPLOITATION
Many graphs encountered in real-world applications harbor a
certain degree of sparsity, which can be exploited to optimize
performance. For example, a SMILES string represents a
molecular graph where edges connect only atoms that are
chemically bonded. In this case, the maximum number of
edges on each node is capped by the maximum number of
bonds that an atom can form, which rarely exceeds 8. A road
network graph is also sparse with 3-way and 4-way junctions
dominating the map. Even for 3D molecular structures where
edges encode contact relationships between all pairs of atoms,
the graphs can still be sparse due to the spatial locality of
non-bond interactions.
We adopt a two-level methodology to exploit the sparsity
in the graphs. In the first level, we exploit the sparsity at the
octile granularity by reducing non-empty tiles through graph
reordering. In the second level, we exploit the sparsity within
individual octiles by using a compact storage scheme that
stores only non-zero elements of the tiles, and by designing
corresponding sparse XMV primitives. In the rest of this
section, we use graph and matrix terms interchangeably.
A. Inter-Tile Sparsity
The sparsity of graphs can be readily exploited within the on-
the-fly XMV framework by pruning empty tiles that contain
no edges. The implementation of this pruning process as a
pre-processing pass is trivial, but its efficiency depends on
our ability to find empty tiles in the matrix. Hence, we resort
to reordering algorithms to group the nonzeros of the matrix
Fig. 5. Detailed benchmark and Roofline analysis of the three on-the-fly XMV primitives. Each primitive other than the naı¨ve one are instantiated with multiple
sets of parameters as given underneath each bar. For the shared tiling primitive, the two parameters specify the height and width of the tiles that are streamed
by the primitive; for the register blocking primitive, the two parameters have similar meanings to their shared tiling counterparts; for the tiling-blocking
primitive, the two parameters corresponds to the size of the square tiles streamed and the length of the chunks staged in registers, respectively.
1
4
16
64
256
Pea
k Gl
oba
l
Pea
k Sh
ared
1/8 1/4 1/2 1 2 4 8 16 32 64
Arithmetic Intensity (SP FLOPS/byte)
1
4
16
64
256
Pe
rfo
rm
an
ce
pe
rS
M
(G
FL
OP
S)
Device Memory
Throughput
Shared Memory
Throughput Per SM
FLOPS
Eƾciency
Walltime
30
2
m
s
(8
,2
)
16
6
m
s
(8
,4
)
67
m
s
(8
,8
)
42
m
s
(8
,1
2)
40
m
s
(8
,2
4)
54
m
s
(8
,4
)
43
m
s
(8
,8
)
39
m
s
(8
,1
6)
50
m
s
(8
,2
)
26
m
s
(8
,4
)
21
m
s
(8
,8
)
19
m
s
2%
(8
,2
)
10
%
(8
,4
)
25
%
(8
,8
)
41
%
(8
,1
2)
43
%
(8
,2
4)
32
%
(8
,4
)
40
%
(8
,8
)
44
%
(8
,1
6)
42
%
(8
,2
)
65
%
(8
,4
)
80
%
(8
,8
)
91
%
5
Gi
B/
s
(8
,2
)
22
Gi
B/
s
(8
,4
)
46
Gi
B/
s
(8
,8
)
72
Gi
B/
s
(8
,1
2)
76
Gi
B/
s
(8
,2
4)
56
Gi
B/
s
(8
,4
)
15
7
Gi
B/
s
(8
,8
)
16
2
Gi
B/
s
(8
,1
6)
15
2
Gi
B/
s
(8
,2
)
12
9
Gi
B/
s
(8
,4
)
12
5
Gi
B/
s
(8
,8
)
73
Gi
B/
s
75
9
Gi
B/
s
(8
,2
)
41
8
Gi
B/
s
(8
,4
)
59
2
Gi
B/
s
(8
,8
)
63
4
Gi
B/
s
(8
,1
2)
61
8
Gi
B/
s
(8
,2
4)
44
3
Gi
B/
s
(8
,4
)
34
5
Gi
B/
s
(8
,8
)
21
1
Gi
B/
s
(8
,1
6)
15
8
Gi
B/
s
(8
,2
)
33
3
Gi
B/
s
(8
,4
)
41
3
Gi
B/
s
(8
,8
)
43
8
Gi
B/
s
Metrics on 5120 pairs of graphs each with 72 nodes
Naïve Shared Tiling Register Blocking Tiling + Blocking
TABLE I
OPERATION COUNT, LOAD/STORE COUNT, AND ASYMPTOTIC ARITHMETIC INTENSITY OF THE ON-THE-FLY KRONECKER PRODUCT MATRIX-VECTOR
MULTIPLICATION (XMV) OPERATION FROM LINE 10 OF ALGORITHM 1 DURING ONE CG ITERATION. FOR A DETAIL DERIVATION OF THE EQUATIONS,
SEE APPENDIX C . n AND m: NUMBERS OF NODES OF THE TWO GRAPHS, RESPECTIVELY; E: BYTE SIZE OF AN EDGE LABEL; F : BYTE SIZE OF AN EDGE
WEIGHT; X : BASE KERNEL OPERATION COUNT.
Naive t× r shared tiling t× r register blocking length-r register blocking
within t× t shared tiling
Ops. 2n2m2 n2m2X n2m2X n2m2X
Global Load n2m2F n2m2( trE +
r+t
r F )/t
2 n2m2( trE +
t+r
r F )/t
2 n2m2(E + 2F )/t2
Global Store nmF nmF nmF nmF
Shared Load - n2m2( r+1r E +
2r+1
r F ) n
2m2F n2m2( r+trt E +
r+t
rt F )
Shared Store - n2m2( trE +
r+t
r F )/t
2 n2m2F/t2 n2m2(E + F )/t2
A.I. Global
2
F
t2X
t/rE + (1 + t/r)F
t2X
t/rE + (1 + t/r)F
t2X
E + 2F
A.I. Shared - X
(1 + 1/r)E + (2 + 1/r)F
X
(1 + 1/t2)F
X
(1/r + 1/t)E + (1/r + 1/t)F
into as few tiles as possible. Among a plethora of heuristics
in the literature for reordering matrices, the ones that we have
experimented with are:
• a custom partitioning-based reordering (PBR) algo-
rithm [9] that targets explicitly the objective of minimiz-
ing the number of non-empty tiles;
• the Reverse Cuthill-McKee (RCM) algorithm [10], which
is a heuristic that has found widespread use for fill-in and
matrix bandwidth reduction;
• a scheme based on solving the Traveling Salesman Prob-
lem (TSP) [11] with heuristics.
• a scheme using space-filling curves such as the Morton
curve [12] or the Hilbert curve when the vertices are
known to come from an embedding in a Euclidean space.
Among the four reordering methods, we have found that the
PBR-based method delivers the most reduction in non-empty
octiles using a moderate amount of time.
The Morton-based method delivers less reduction than RCM
and PBR despite being marginally faster. The TSP-based
method achieves a reduction rate between RCM and PBR.
However, the running time of the TSP-based reordering algo-
rithm is substantially longer than all other reordering methods
by orders of magnitude. Hence, we decide to focus only on
RCM and PBR in subsequent discussions, and present two
examples of molecular graphs representing the protein 2ONW
and 1AY3 from the Protein Data Bank (PDB) in their natural
orders, the RCM order, and the PBR order, respectively, in
Fig. 6.
In the particular case where the graphs represent 3D protein
structures, the nodes in their natural order, e.g. the order of the
corresponding amino acid residues in the primary structure of
the protein, already yields nearly optimal sparsity pattern in the
adjacency matrix. However, the PBR order can still beat the
natural order in reducing non-empty tiles for different datasets,
as evident in Figures 6 and 7. Moreover, reordering is useful
in the general case because the natural orderings of the nodes
are not always available.
Partitioning-based Reordering (PBR) for improved tile
density
The goal of PBR in our case is to reorder nodes in a graph
G = (V,E), i.e., to come up with a permutation of the rows
and columns of the corresponding matrix, in order to minimize
Fig. 6. An example where the partition-based reordering (PBR) method
outperforms the amino acid sequence order and the RCM order, yielding
graphs with fewer and more densely occupied tiles on two molecular graphs
from the PDB dataset.
19 tiles populated 19 tiles populated 13 tiles populated
44 tiles populated 40 tiles populated 32 tiles populated
2
O
N
W
1
A
Y
3
N AT U R A L R CM PB R
the number of non-empty t× t square tiles.
Let Π(G) = {V1, V2, . . . , VK} be a perfectly balanced K-
way vertex partition of G with K = d|V |/te, where all
parts in Π(G) with the possible exception of the last part has
exactly the same number of vertices. Π(G) then implies a
vertex ordering, where the vertices in Vk are ordered before
the vertices in Vk+1, for 1 ≤ k < K. Observe that for any
1 ≤ k 6= ` ≤ K, if there is at least one edge between the
nodes within Vk and V`, then the tile at the intersection of
kth row stripe and `th column stripe of the matrix, as well
as its symmetric counterpart, are non-empty. If there are no
edges between Vk and V`, then the respective tiles are empty.
Therefore, we can define the objective of PBR as finding the
Π(G) that minimizes
|{(Vk, V`) : k 6= ` and (vi ∈ Vk, vj ∈ V`) ∈ E}|. (3)
To seek a good Π(G), we utilize an approach [9] that is fast
and has a consistent objective with (3) but does not always
guarantee a perfectly balanced partition. Perfectly balanced
graph partitioning problem has previously been studied in
the literature [13], [14], usually with a different objective
of minimizing the number of inter-partition edges. These ap-
proaches also emphasize partition quality over speed and rely
on expensive algorithms such as tabu search. The approach
that we use here derives from a recursive bipartitioning scheme
that initially aims at reducing the messages sent in a parallel
application, which are modeled as off-diagonal blocks in a
matrix. The bipartitioning heuristics are much faster than the
algorithms used for perfectly balanced partitioning.
Even though the PBR algorithm [9] does not guarantee that
the partitions be perfectly balanced, an imbalance is rare as
long as all vertices have the same weight, which is precisely
the case in our work. Nonetheless, for the cases in which
the partitioner could not obtain a perfectly balanced partition,
we append an extra refinement step to move vertices from
the overloaded part to the underloaded part based on the
Fiduccia-Mattheyses (FM) algorithm [15]. We also utilized a
custom weight distribution, as opposed to a single imbalance
parameter, in the recursive bipartitioning process to promote
equally sized parts. Moreover, we adjust the parameters of the
partitioner to ensure a tight balancing constraint by setting
the refinement algorithm to boundary FM with tight balance.
Finally, we set the cost of the message nets, a parameter that
emphasizes the importance of the reduction of non-empty tiles,
to a large value such as 50.
Fig. 7. The PBR method consistently outperforms other graph reordering
algorithms and improves the sparsity pattern of graphs from four datasets of
real and synthetic graphs.
0% avg. density 18% 100%
0% avg. density 18% 100%
0% avg. density 25% 100%
NAT
URA
L
36%
RCM 37%
PBR 27%
Protein crystal structure
0% avg. density 8% 100%
0% avg. density 9% 100%
0% avg. density 9% 100%
NAT
URA
L
50%
RCM 43%
PBR 43%
DrugBank
0% avg. density 13% 100%
0% avg. density 11% 100%
0% avg. density 16% 100%
NAT
URA
L
51%
RCM 57%
PBR 41%
Newman-Watts-Strogatz
0% avg. density 12% 100%
0% avg. density 12% 100%
0% avg. density 16% 100%
NAT
URA
L
97%
RCM 93%
PBR 74%
Barabási-Albert
Precentage of
non-empty tiles
Density distribution within
non-empty tiles
In Fig. 7, we illustrate the performance of the PBR order
as compared to the natural order and the RCM order on four
different datasets as detailed in Section VI. PBR achieves the
best reduction over the natural ordering in all datasets, while
RCM can only improve the non-empty octile count in two of
the datasets.
Reordering overhead Reordering is justified when its cost
is smaller than the computational savings it enables. The
PBR reordering incurs a linear-time pre-processing overhead
proportional to the number of non-zeros in the matrices, while
Fig. 8. The comparative advantages of the different dense/sparse primitives
depend on the sparsity patterns of the tiles. Dynamic selection of the primitives
can thus lead to performance improvements.
8 16 24 32 40 48 56 64
non-zeros in 1st tile
8
16
24
32
40
48
56
64
no
n-
ze
ro
s
in
2n
d
til
e
unlabeled graph
8 16 24 32 40 48 56 64
non-zeros in 1st tile
labeled graph
ProƼtable Regions of Each Primitive
sparse×sparse dense×sparse dense×dense
the marginalized graph kernel incurs a quadratic cost in the
number of non-zeros during each CG iteration. Moreover, the
graph kernel often has to be evaluated on all pairs of graphs
for hundreds of times to train a machine learning model, while
the training data only need to be reordered once. Hence, the
overhead of the PBR reordering can be quickly amortized and
leads to shorter overall time-to-solution.
B. Intra-Tile Sparsity
As already demonstrated in Section III, the tiling-blocking
kernel is very efficient on dense tiles. Moreover, the kernel
is still efficient on most sparse graphs because our reordering
algorithms tend to create locally dense areas in the matrices.
Nonetheless, as seen from Figure 7, although the reordering
methods indeed increase the octile density compared to the
natural order, the non-empty tiles can still be up to 90%
empty. Hence, we can attain considerable savings by storing
and processing only the nonzero elements instead of treating
the tiles as dense.
In order to exploit sparsity within an octile, we use a
compact layout to store only nonzero elements. An accom-
panying 64-bit integer, whose ith element is set if the ith
element is nonzero, is used to locate the nonzero elements
in the original octile. We then rely on bit manipulations to
find the indices of the nonzero elements. Compared to the
dense octile representation, the sparse representation reduces
unnecessary global memory transactions besides wasting flops.
However, this comes at the expense of increased shared
memory utilization.
Hybrid dense-sparse computation The optimal way of eval-
uating the XMV operation given a pair of tiles depends on
the sparsity of the tiles. Utilizing a single primitive for the
entire execution may hurt the performance as the kernels’
performance largely depend on the octile density, which can
vary significantly within and across datasets as visualized in
Figure 7. As such, we designed two new types of XMV prim-
itives in addition to the dense× dense kernel: (i) a primitive
for the tensor product between a dense tile and a sparse tile, or
vice versa (dense× sparse), and (ii) a primitive for the tensor
product between two sparse tiles (sparse× sparse).
Fig. 8 illustrates the best performing product kernel for
a varying number of nonzeros of the two source octiles for
both labeled and unlabeled graphs. The sparse×sparse kernel
performs the best when each of the octiles contains up to 8-10
nonzeros for the unlabeled graphs and up to 16 nonzeros for
the labeled graphs. The dense×dense kernel runs the fastest
from that point on as both of the octiles get denser. In the rest,
the dense×sparse kernel performs better.
In our production kernel, we dynamically select either the
sparse×sparse or the dense×dense kernel before carrying
out the tensor product operations depending on the type of
the graph and the number of products the two octiles require.
The octiles are always stored in a compact form and expanded
in the shared memory after loading them from global memory.
V. TILE SHARING AND LOAD BALANCING
A. Block-Level Sharing
To fully utilize the GPU, which can simultaneously execute
thousands of warps on the fly, we perform the graph kernel
computations between many different pairs of graphs simulta-
neously within a single kernel launch.
One option is to assign each thread warp a unique graph
pair, while the program assumes a SIMD model within each
warp. No explicit synchronization or cooperation between
thread warps is needed. It is unfavorable when low-latency
computations for a few graphs are required because the work
on each pair of graphs can only be parallelized among a fixed
small number of threads. Consequently, thousands of graph
pairs are needed to provide enough concurrency to saturate
the thousands of CUDA cores on a Volta GPU.
A second option is to further parallelize the computation
within a thread block, whose size can vary between 32 to
1024 threads on CUDA GPUs. A first and obvious benefit of
this approach is that it provides us the ability to use block
size to adjust the latency for computing each pair of graphs,
as well as allowing a smaller number of graph pairs to saturate
the entire GPU. This block-based cooperative approach also
has the potential to further improve performance by allowing
warps within a block to share the octiles in shared memory. As
revealed in the Roofline analysis, larger tiles results in more
data reuse, less redundant load/store operations, and higher
arithmetic intensity. However, there is a limit on the size of
tiles that a warp can hold without constraining occupancy,
i.e. the number of warps on the fly. To work around this,
we let all the N warps in a CUDA thread block each load
an octile, and then share the octiles to compute N2 tile-level
XMV operations.
Tile sharing requires block-level synchronization before
and after octile loading. Besides, atomic accumulations are
necessary for writing to the output vector since the COO
storage format obscures the effort to schedule workload among
the warps in ways such that the output could be conflict-free.
However, the performance impact on CUDA GPUs should be
very minimal because atomic accumulations whose outputs are
not immediately used are carried out by nonblocking atomic
reduction instructions. As such, the threads that commit the
atomic accumulations will not get stalled.
B. Inter-Block Load Balancing
Thanks to the independence of the computations between
different pairs of graphs, load balancing is relatively straight-
forward since tasks can freely relocate across thread blocks
and stream processors. Aside from transient factors such as
warp scheduling, cache conflict, and atomics, the primary
source of load imbalance is the variation of graph size and
sparsity pattern that affect the problem size as well as the
number of conjugate gradient iterations for convergence.
VI. BENCHMARK DATASET
A. Synthetic Graphs
To test the performance of our solver, we use the Newman-
Watts-Strogatz (NWS) algorithm and the Baraba´si-Albert (BA)
algorithm to generate synthetic graphs of small-world and
scale-free characteristics, respectively.
B. Real-World Dataset
The graph kernel is further tested on real-world datasets as
summarized below:
1) The PDB-3k dataset is a 1324-structures subset of the
Protein Data Bank database [16] containing proteins
less than 3000 in weight and contain no DNA/RNA
complexes. Each protein is converted into a graph with
nodes representing heavy atoms. A spatial adjacency
rule creates edges between spatially neighboring atoms
such that the weights reach maximum when two atoms
overlap, and smoothly decay to zero at a certain cutoff
distance. The edges are labeled with the interatomic
distance between its endpoints.
2) DrugBank [17] is a comprehensive database containing
information about drug molecules. The dataset contains
more than 104 drug molecules, 10607 of which has
a corresponding linearized representation as a SMILES
string, which is obtained from a depth-first traversal of
the corresponding molecular graph. A rich body of node
and edge attributes can be extracted from the SMILES
strings such as hybridization state, charge, bond order,
and conjugacy.
VII. PERFORMANCE MEASUREMENT AND ANALYSIS
Benchmarks are performed on the Summit supercom-
puter at Oak Ridge National Laboratory. The runtime
and performance metrics of our GPU kernels are mea-
sured using the nvprof program from the CUDA Toolkit,
while CPU-side time measurements are obtained using the
time.perf_counter_ns() method from the Python
standard library.
A. Performance Improvement of Proposed Optimization Tech-
niques
In this section, we characterize and compare the performance
gain enabled by the previously described optimization tech-
niques on both the synthetic and real-world graph datasets.
The measurements are carried out using the naı¨ve kernel as
a baseline and then enabling the optimization techniques one
at a time in the same order as they appear in the previous
sections. For the synthetic graph datasets, we generate 160
graphs containing 96 nodes for each type with the following
parameters:
· Newman-Watts-Strogatz: k = 3, p = 0.1;
· Baraba´si-Albert: m = 6.
We also test the kernel on all graphs in the PDB and DrugBank
datasets.
From Figure 9, we can conclude that the performance
improvements brought about by the techniques depend on the
characteristics of the actual dataset. Overall, the speedup is
more impressive on the real-world datasets, which contain
more diverse types of graphs. It turns out inter-tile sparsity
exploitation, when directly applied to the graphs in their
natural order, improves the performance on all datasets except
for the scale-free networks which contain poor locality. On
top of that, PBR-based reordering performs very well and
increases the performance of the solver on all datasets. The
adaptive dense/sparse primitive switch and the compact tile
storage format can further improve solver performance on all
datasets.
Block-level tile sharing leads to significant performance
improvement on DrugBank but only mild improvements on
other datasets. The reason is that only the DrugBank dataset
exhibits considerable size variation with graphs containing
1 to 551 nodes. In that case, block-level tile sharing can
significantly reduce the time to compute the largest pair of
molecules, which otherwise takes a very long time using only
a single warp. Dynamic scheduling brings about marginal
performance improvements because the GPUs are already
saturated by our datasets.
B. Performance Comparison with State-of-the-Art Packages
We further compare the performance of our solver against
two state-of-the-art packages for graph kernel computations:
GraKeL [18] and GraphKernels [19]. GraKeL is a Python
package compatible with scikit-learn [20]. The compute-
intensive part of GraKeL is implemented using Cython [21],
which compiles codes written in a Python-like syntax into
binaries on the target machine. The GraphKernels package
is implemented in C++ and has a Python frontend generated
with SWIG [22]. Both packages run only on CPUs, although
GraKeL does support parallelism using multiple processes but
with limited scaling efficiency. When executing the codes on
the Power9 cores of Summit, we allocate 4 physical cores to
GraKeL and 1 physical core to GraphKernels.
As shown by Figure 10, our solver significantly outperforms
both GraKeL and GraphKernels by 3-4 orders of magnitude
on real-world datasets. Besides performance, it is worth noting
that we had to carry out the computation using a relatively
large stopping probability for both GraKeL and GraphKernels
to avoid convergence failures. Coincidentally, a larger stopping
probability can reduce time to solution at the expense of the
discriminating power of the kernel. Our presented kernel does
Fig. 9. Time to solution for the presented solver equipped with different
optimization techniques (data with an asterisk are projected from ensembles of
32 random subsets of the entire dataset). Within each dataset, each bar repre-
sents a kernel that incorporates a new optimization technique while inheriting
everything else from the kernel below. Label interpretation: Dense – the naı¨ve
kernel, Sparse – sparsity exploitation at the inter-tile level, +Reorder –
enabling PBR graph reordering, +Adaptive – adaptive switching between
dense and sparse tile primitives, +Compact – compact storage format of each
tile, +Block: sharing of tiles within a block, +DynSched – dynamic work
scheduling.
Incremental Speedup of Proposed Optimization Techniques
Dense
Sparse
+Reorder
+Adaptive
+Block
+DynSched
Sm
al
lW
or
ld
8.4 s
2.6 s
1.7 s
0.92 s
0.80 s
0.78 s
Dense
Sparse
+Reorder
+Adaptive
+Compact
+Block
+DynSched
Sc
al
e-
fre
e
7.4 s
7.6 s
4.5 s
2.5 s
2.1 s
2.0 s
1.9 s
Dense
Sparse
+Reorder
+Adaptive
+Compact
+Block
+DynSched
Pr
ot
ei
n
4919 s
1491 s
884 s
772 s
725 s
158 s
157 s
Dense
Sparse
+Block
+Reorder
+Adaptive
+Compact
+DynSched
Dr
ug
Ba
nk
56152 s∗
12572 s∗
507 s
386 s
269 s
256 s
258 s
not have a convergence issue and can compute using stopping
probability values as small as 0.0005.
Fig. 10. The solver presented in this paper outperforms existing Python
packages by several orders of magnitude.
1 second 1 min 1 hour 1 day 1 month
Dru
gBa
nk
PDB
Time-to-solution comparison with GraKeL and GraphKernels
172 seconds
12.9 days
2.0 days
6461×
998×
153 seconds
5.8 days
22.0 days
3297×
12430×
Present GraKeL GraphKernels
VIII. RELATED WORK
The two packages GraKeL and Graph-Kernels that we have
compared against in section VII-B provide the closest func-
tionality to our solver, but we significantly outperform them by
several orders of magnitude. Moreover, only GraKeL supports
graphs with both labeled vertices and labeled edges, which
are crucial for building accurate machine learning models
for molecular systems [2]. It is easily verifiable that the
normalized Gramian matrix generated using unlabeled graphs
contains only numbers all very close to unity, implying that
all graphs are identical to each other under the unlabeled sim-
ilarity measure. As such, we believe the present solver, which
can efficiently compute the graph kernel for labeled graphs,
represents not only an improvement in terms of computational
speed but also an enhancement of the functionality available
to the end-users.
The graph kernel is fundamentally different from network
alignment algorithms [23] that can provide an estimate on the
similarity of two graphs. Two issues make network alignment
algorithms unsuitable as kernels between graphs. First, align-
ment algorithms generally are not positive definite functions,
i.e. they do not induce a norm on an associated Hilbert space.
Second, alignment algorithms are potentially more expensive
because it involves more work in addition to computing nodal
similarities.
Last, the work of Livi et al. [24] contains an algorithmic
motif concerning the parallel computation of graph tensor
products for inexact graph matching. While their formulation
also has a product weight matrix that is computed using a
vertex kernel and an edge kernel, the product graph is not
used to construct a linear system that has to be solved.
IX. CONCLUSION
In this paper, we presented a series of algorithms to accelerate
a marginalized graph kernel solver on GPU. The solver
is essentially an implementation of the conjugate gradient
method for a generalized Laplacian system induced by the
Kronecker product of a pair of graphs. Via roofline analysis,
we identified that the solver would likely be memory-bound
due to the matrix-vector inner product operation in the CG
method. We overcame this issue by taking advantage of the
Kronecker product structure of the system. In our approach, we
do not precompute the product system, but rather stream and
cache the original graph pair in tiles, and compute the product
system on-the-fly. This approach significantly reduces global
memory traffic, and only increases the asymptotic arithmetic
operation count by a constant factor, which can be easily offset
by the substantial gain in instruction throughput. Moreover,
the solver can take advantage of the sparsity in the graph by
making use of a two-level storage format to trim out zero
elements. We compared the performance of the solver with
existing packages and demonstrated that our implementation
delivered significant speedups.
This work also exemplifies the paradigm of applying linear
algebra concepts and techniques to solving graph problems.
The graph kernel problem constitutes a concrete example
of the need for standardized application programming in-
terfaces for graph tensor products in specifications such as
GraphBLAS [25], and prompt for the development of high-
performance and general implementations of the interface. Our
work suggests that the semantics for the inner product between
tensor product structures may see broader applicability than
that for the mere computation of the tensor product itself.
ACKNOWLEDGMENT
This work was supported by the Luis W. Alvarez Postdoctoral
Fellowship at Lawrence Berkeley National Laboratory. This
work is also supported in part by the Applied Mathematics
program of the DOE Office of Advanced Scientific Computing
Research under Contract No. DE-AC02-05CH11231, and in
part by the Exascale Computing Project (17-SC-20-SC), a
collaborative effort of the U.S. DOE Office of Science and
the NNSA. This research used resources of the Oak Ridge
Leadership Computing Facility at the Oak Ridge National
Laboratory, which is supported by the Office of Science of
the U.S. DOE under Contract No. DE-AC05-00OR22725.
YHT thanks David Williams-Young, and Caitlin A Whitter
for helpful discussions and suggestions.
REFERENCES
[1] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between
labeled graphs,” in Proceedings of the 20th International Conference on
Machine Learning (ICML-03). AAAI Press, 2003, pp. 321–328, 00000.
[2] Y.-H. Tang and W. A. de Jong, “Prediction of atomization energy using
graph kernel and active learning,” The Journal of Chemical Physics, vol.
150, no. 4, p. 044107, Jan. 2019, autocitation-1.
[3] K. M. Borgwardt, C. S. Ong, S. Scho¨nauer, S. V. N. Vishwanathan,
A. J. Smola, and H.-P. Kriegel, “Protein function prediction via graph
kernels,” Bioinformatics, vol. 21, no. suppl 1, pp. i47–i56, Jun. 2005.
[4] Y.-H. Tang, O. Selvitopi, D. Popovici, and A. Buluc¸, “A High-
Throughput Solver for Marginalized Graph Kernels on GPU,”
arXiv:1910.06310 [cs], Dec. 2019.
[5] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M.
Borgwardt, “Graph kernels,” Journal of Machine Learning Research,
vol. 11, no. Apr, pp. 1201–1242, 2010, 00000.
[6] S. Vishwanathan, K. M. Borgwardt, and N. N. Schraudolph, “Fast
Computation of Graph Kernels,” in NIPS, vol. 19, 2006, pp. 131–138.
[7] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissect-
ing the NVIDIA Volta GPU Architecture via Microbenchmarking,”
arXiv:1804.06826 [cs], Apr. 2018.
[8] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful
visual performance model for multicore architectures,” Communications
of the ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.
[9] O. Selvitopi, S. Acer, and C. Aykanat, “A recursive hypergraph biparti-
tioning framework for reducing bandwidth and latency costs simultane-
ously,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 2, pp. 345–358,
Feb. 2017.
[10] A. George and J. W. H. Liu, Computer Solution of Large Sparse
Positive Definite Systems, ser. Prentice-Hall Series in Computational
Mathematics. Englewood Cliffs, NJ: Prentice-Hall, 1981.
[11] A. Pinar and M. T. Heath, “Improving Performance of Sparse Matrix-
Vector Multiplication,” in SC ’99: Proceedings of the 1999 ACM/IEEE
Conference on Supercomputing, Nov. 1999, pp. 30–30.
[12] Y.-H. Tang and G. E. Karniadakis, “Accelerating dissipative particle
dynamics simulations on GPUs: Algorithms, numerics and applications,”
Computer Physics Communications, vol. 185, no. 11, pp. 2809–2822,
Nov. 2014, 00025.
[13] P. Sanders and C. Schulz, “Think Locally, Act Globally: Highly
Balanced Graph Partitioning,” in Experimental Algorithms, ser. Lec-
ture Notes in Computer Science, V. Bonifaci, C. Demetrescu, and
A. Marchetti-Spaccamela, Eds. Springer Berlin Heidelberg, 2013, pp.
164–175.
[14] U. Benlic and J.-K. Hao, “An effective multilevel tabu search approach
for balanced graph partitioning,” Comput. Oper. Res., vol. 38, no. 7, pp.
1066–1075, Jul. 2011.
[15] C. M. Fiduccia and R. M. Mattheyses, “A linear-time heuristic for
improving network partitions,” in Proceedings of the 19th Design
Automation Conference, ser. DAC ’82. Piscataway, NJ, USA: IEEE
Press, 1982, pp. 175–181.
[16] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The Protein Data Bank,”
Nucleic Acids Research, vol. 28, no. 1, pp. 235–242, Jan. 2000.
[17] D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant,
T. Sajed, D. Johnson, C. Li, Z. Sayeeda, N. Assempour, I. Iynkkaran,
Y. Liu, A. Maciejewski, N. Gale, A. Wilson, L. Chin, R. Cummings,
D. Le, A. Pon, C. Knox, and M. Wilson, “DrugBank 5.0: A major update
to the DrugBank database for 2018,” Nucleic Acids Research, vol. 46,
no. D1, pp. D1074–D1082, Jan. 2018.
[18] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and
M. Vazirgianis, “GraKeL: A Graph Kernel Library in Python,” Jun. 2018.
[19] M. Sugiyama, M. E. Ghisu, F. Llinares-Lo´pez, and K. Borgwardt,
“Graphkernels: R and Python packages for graph comparison,” Bioin-
formatics, vol. 34, no. 3, pp. 530–532, Feb. 2018.
[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E´. Duch-
esnay, “Scikit-learn: Machine Learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, Oct. 2011.
[21] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and
K. Smith, “Cython: The Best of Both Worlds,” Computing in Science
and Engg., vol. 13, no. 2, pp. 31–39, Mar. 2011.
[22] D. M. Beazley, “Automated Scientific Software Scripting with SWIG,”
Future Gener. Comput. Syst., vol. 19, no. 5, pp. 599–609, Jul. 2003.
[23] R. Singh, J. Xu, and B. Berger, “Global alignment of multiple protein
interaction networks with application to functional orthology detection,”
Proceedings of the National Academy of Sciences, vol. 105, no. 35, pp.
12 763–12 768, Sep. 2008.
[24] L. Livi and A. Rizzi, “Parallel algorithms for tensor product-based
inexact graph matching,” in The 2012 International Joint Conference
on Neural Networks (IJCNN), Jun. 2012, pp. 1–8.
[25] A. Buluc¸, T. Mattson, S. McMillan, J. Moreira, and C. Yang, “Design of
the GraphBLAS API for C,” in 2017 IEEE International Parallel and
Distributed Processing Symposium Workshops (IPDPSW), May 2017,
pp. 643–652.
[26] Wendland, Scattered Data Approximation. Cambridge university press,
2004, vol. 17.
[27] Y.-H. Tang, D. Zhang, and G. E. Karniadakis, “An atomistic fingerprint
algorithm for learning ab initio molecular force fields,” The Journal of
Chemical Physics, vol. 148, no. 3, p. 034101, Jan. 2018, 00000.
APPENDIX
A. Derivation of the Linear Algebra Form of the Marginalized Graph Kernel
A formula for evaluating the marginalized graph kernel, as directly implied by the random walk picture, reads:
K(G,G′) =
∞∑
`=1
∑
h
∑
h′
[
ps(h1) p
′
s(h
′
1) κv(vh1 , v
′
h′1
) pq(h`) p
′
q(h
′
`)
(∏`
i=2
pt(hi|hi−1)
) ∏`
j=2
p′t(h
′
j |h′j−1)

(∏`
k=2
κv(vhk , v
′
h′k
)κe(ehk−1hk , e
′
h′k−1h
′
k
)
) ]
. (4)
However, an equivalent formulation [2], which transforms the task into solving a generalized Laplacian equation on the tensor
product graph, permits more efficient numerical computation. To obtain this linear algebra formulation, we restate Equation (4)
under the spirit of dynamic programming following [1]:
K(G,G′) =
∑
h1∈V,h′1∈V ′
ps(h1) p
′
s(h
′
1) κv(h1, h
′
1) R∞(h1, h
′
1), (5)
where R∞ is the solution to the linear system:
R∞(h1, h′1) = pq(h1) p
′
q(h
′
1) +
∑
i∈V,j∈V ′
t(i, j, h1, h
′
1) R∞(i, j), (6)
with
t(i, j, h1, h
′
1) := pt(i|h1) p′t(j|h′1) κv(vi, vj) κe(ei h1 , ej h′1). (7)
Equations (5) to (7) exhibit a Kronecker product structure, which can be readily recognized in matrix form:
K(G,G′) =
(
p⊗ p′
)T
· diag
(
v
κv⊗ v′
)
· r∞, (8)
with r∞ being the solution to the linear system
r∞ = q⊗ q′ +
[(
P⊗P′
)

(
E
κe⊗ E′
)]
· diag
(
v
κv⊗ v′
)
· r∞, (9)
where
v is the vertex label vector of G with vi = vi;
p is the starting probability vector of G with pi = ps(vi);
q is the stopping probability vector of G with qi = pq(vi);
P is the transition probability matrix of G defined as D−1A;
E is the edge label matrix of G with Eij = eij ;
v′, p′, q′, P′, E′ are the corresponding vectors and matrices for G′;
κv⊗ is the generalized Kronecker product between v and v′ with respect to κv;
κe⊗ is the generalized Kronecker product between E and E′ with respect to κe.
For clarity of discussion, we denote
V× := diag
(
v
κv⊗ v′
)
,
D× := diag(d)⊗ diag(d′),
A× := A⊗A′,
P× := P⊗P′ = D−1× A×,
E× := E
κe⊗ E′,
p× := p⊗ p′,
q× := q⊗ q′.
To solve eq. (9), first observe that only the product V×r∞ as a whole is needed to compute K(G,G′). We can thus rearrange
eq. (9) to form a symmetric linear system.
r∞ − (P× E×)V×r∞ = q×, (10)(
V−1× −P× E×
)
V×r∞ = q×, (11)
V×r∞ =
(
V−1× −P× E×
)−1
q× (12)
=
[
V−1× −
(
D−1× A×
)E×]−1 q× (13)
=
(
D×V−1× −A× E×
)−1
D×q×. (14)
The linear system D×V−1× −A× E× in eq. (14) is symmetric and positive-definite, as long as q > 0, κv(·, ·) <= 1, and
κe(·, ·) <= 1. Thus, we have reached the full expression for the marginalized graph kernel in matrix form, the solution of
which is the central focus of this paper:
K(G,G′) = pT×
(
D×V−1× −A× E×
)−1
D×q×. (15)
B. Example of edge label and kernel
Some examples of edge kernels used in practice are: 1) a square exponential kernel κSE(e1, e2)
.
= exp[−α(e1−e2)2] consumes
two floats and carries out 3 multiplication and 1 exponentiation; 2) a degree n compact polynomial radial basis kernel, e.g.
in the form κP(e1, e2)
.
=
∑
i αi(e1 − e2)i [26], [27] consumes two floats and performs n chained FMA instructions; 3) a
Kronecker product kernel κkron(e1, e2)
.
=
∏
i κi(e
i
1, e
i
2) consumes 2n inputs and carry out a linearly proportional number
of operations; 4) an R-convolutional kernel κR(e1, e2) =
∑
i
∑
j κ(e
i
1, e
j
2) consumes 2n inputs and carry out a quadratically
proportional number of arithmetics.
C. Pseudocode, I/O and Operation Counts of on-the-fly XMV primitives
Naı¨ve
Line Algorithm Category Loop count Unit cost Total cost
1 parfor i ∈ [0 . . . nm] do
2 ai ← 0
3 for J ∈ [0,WARPSIZE, . . . nm] do
4 GLOBALLOAD(pJ+LANE) LD.G nm · nm/32 F n2m2F/32
5 for j ∈ J + [0, 32) do
6 GLOBALLOAD(Lij) LD.G nm · nm F n2m2F
7 SHUFFLE(pj) from lane j
8 ai ← ai + Lij × pj OPS nm · nm 2 2n2m2
9 GLOBALSTORE(ai) ST.G nm F nmF
Shared Tiling
Line Algorithm Category Loop count Unit cost Total cost
1 for I ∈ [0, t, 2t . . . n), I ′ ∈ [0, t, 2t, . . .m) do
2 parfor i ∈ I + [0, t), i′ ∈ I ′ + [0, t) do
3 aii′ ← 0
4 for J ∈ [0, r, 2r . . . n) do
5 GLOBALLOAD(AI+[0,t),J+[0,r)) LD.G nt · mt · nr rtF n2mF/t
6 SHAREDSTORE(AI+[0,t),J+[0,r)) ST.S rtF n2mF/t
7 GLOBALLOAD(EI+[0,t),J+[0,r)) LD.G rtE n2mE/t
Continued on next page
TABLE III – Continued from previous page
Line Algorithm Category Loop count Unit cost Total cost
8 SHAREDSTORE(EI+[0,t),J+[0,r)) ST.S rtE n2mE/t
9 for J ′ ∈ [0, r, 2r . . .m) do
10 GLOBALLOAD(A′I′+[0,t),J′+[0,r)) LD.G
n
t · mt · nr · mr rtF n2m2F/rt
11 SHAREDSTORE(A′I′+[0,t),J′+[0,r)) ST.S rtF n
2m2F/rt
12 GLOBALLOAD(E′I′+[0,t),J′+[0,r)) LD.G rtE n
2m2E/rt
13 SHAREDSTORE(E′I′+[0,t),J′+[0,r)) ST.S rtE n
2m2E/rt
14 GLOBALLOAD(pJ+[0,r),J′+[0,r)) LD.G r2F n2m2F/t2
15 SHAREDSTORE(pJ+[0,r),J′+[0,r)) ST.S r2F n2m2F/t2
16 parfor i ∈ I + [0, t), i′ ∈ I ′ + [0, t) do
17 for j ∈ J + [0, r) do
18 SHAREDLOAD(Aij ,Eij) LD.S nt · mt · nr · mr · t · t · r E + F n2m2(E + F )/r
19 for j′ ∈ J ′ + [0, r) do
20 SHAREDLOAD(Ai′j′ ) LD.S nt · mt · nr · mr · t · t · r · r F n2m2F
21 SHAREDLOAD(Ei′j′ ) LD.S E n2m2E
22 SHAREDLOAD(pjj′ ) LD.S F n2m2F
23 aii′ ← aii′ + pjj′ ·Aij ·A′i′j′
24 · κe(Eij ,E′i′j′) OPS X n2m2X
25 GLOBALSTORE(aI+[0,t),I′+[0,t)) ST.G nt · mt t2F nmF
Register Blocking
Line Algorithm Category Loop count Unit cost Total cost
1 for I ∈ [0, t, 2t . . . n), I ′ ∈ [0, t, 2t, . . .m) do
2 aI+[0,t),I′+[0,t) ← 0
3 for J ∈ [0, r, 2r . . . n) do
4 GLOBALLOAD(AI+[0,t),J+[0,r)) LD.G nt · mt · nr rtF n2mF/t
5 GLOBALLOAD(EI+[0,t),J+[0,r)) LD.G rtE n2mE/t
6 for J ′ ∈ [0, r, 2r, . . .m) do
7 GLOBALLOAD(A′I′+[0,t),J′+[0,r)) LD.G
n
t · mt · nr · mr rtF n2m2F/rt
8 GLOBALLOAD(E′I′+[0,t),J′+[0,r)) LD.G rtE n
2m2E/rt
9 GLOBALLOAD(pJ+[0,r),J′+[0,r]) LD.G r2F n2m2F/t2
10 SHAREDSTORE(pJ+[0,r),J′+[0,r]) ST.S r2F n2m2F/t2
11 parfor i ∈ I + [0, t), i′ ∈ I ′ + [0, t) do
12 for j ∈ J + [0, r), j′ ∈ J ′ + [0, r) do
13 SHAREDLOAD(pjj′ ) LD.S n2m2 F n2m2F
14 aii′ ← aii′ + pjj′ ·Aij ·A′i′j′
15 · κe(Eij ,E′i′j′) OPS X n2m2X
16 GLOBALSTORE(aI+[0,t),I′+[0,t′)) ST.G nt · mt t2F nmF
Tiling-Blocking
Line Algorithm Category Loop count Unit cost Total cost
1 for I ∈ [0, t, 2t . . . n), I ′ ∈ [0, t, 2t, . . .m) do
2 parfor i ∈ I + [0, t), i′ ∈ I ′ + [0, t) do
3 aii′ ← 0
4 for J ∈ [0, t, 2t . . . n) do
5 GLOBALLOAD(AI+[0,t),J+[0,t)) LD.G nt · mt · nt t2F n2mF/t
6 SHAREDSTORE(AI+[0,t),J+[0,t)) ST.S t2F n2mF/t
7 GLOBALLOAD(EI+[0,t),J+[0,t)) LD.G t2E n2mE/t
8 SHAREDSTORE(EI+[0,t),J+[0,t)) ST.S t2E n2mE/t
9 for J ′ ∈ [0, t, 2t . . .m) do
10 GLOBALLOAD(A′I′+[0,t),J′+[0,t)) LD.G
n
t · mt · nt · mt t2F n2m2F/t2
11 SHAREDSTORE(A′I′+[0,t),J′+[0,t)) ST.S t2F n
2m2F/t2
12 GLOBALLOAD(E′I′+[0,t),J′+[0,t)) LD.G t2E n
2m2E/t2
13 SHAREDSTORE(E′I′+[0,t),J′+[0,t)) ST.S t2E n
2m2E/t2
14 GLOBALLOAD(pJ+[0,t),J′+[0,t)) LD.G t2F n2m2F/t2
15 parfor i ∈ I + [0, t), i′ ∈ I ′ + [0, t) do
16 for h ∈ [J, J + r, . . . J + t) do
17 SHAREDLOAD(Ai,h+[0,r)) LD.S nt · mt · nt · mt · t · t · tr rF n2m2F/t
18 SHAREDLOAD(Ei,h+[0,r)) LD.S rE n2m2E/t
19 for h′ ∈ [J ′, J ′ + r, . . . J ′ + t) do
20 SHAREDLOAD(A′i′,h′+[0,r)) LD.S nt · mt · nt · mt · t · t · tr · tr rF n2m2F/r
21 SHAREDLOAD(E′i′,h′+[0,r)) LD.S rE n
2m2E/r
22 for j ∈ h+ [0, r) do
23 for j′ ∈ h′ + [0, r) do
24 aii′ ← aii′ + pjj′ ·Aij ·A′i′j′
25 · κe(Eij ,E′i′j′) OPS n2m2 X n2m2X
26 GLOBALSTORE(aI+[0,t),I′+[0,t)) ST.G
n
t
· m
t t
2F nmF
