High performance DFS-based subgraph enumeration on GPUs by Dodeja, Vibhor
© 2021 Vibhor Dodeja





Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the






Subgraph enumeration is an important problem in the field of Graph Analyt-
ics with numerous applications. The problem is provably NP-complete and
requires sophisticated heuristics and highly efficient implementations to be
feasible on problem sizes of realistic scales. Parallel solutions have shown a
lot of promise on CPUs and distributed environments. GPU-based solutions
are gaining traction in recent times.
Subgraph enumeration involves traversing a search tree formulated to find
matches of a query in a graph. Most GPU-based solutions traverse the
tree in a breadth-first manner which exploits parallelism at the cost of high
memory requirement, which presents a formidable challenge for processing
large graphs since the memory capacity of GPUs is significantly lower than
that of CPUs. In this thesis, we propose a fast GPU-based solution based
on depth-first traversal of the search tree. The depth-first approach requires
less memory but presents more challenges for parallel execution.
We apply various optimizations to optimally utilize memory and compute
resources of the GPUs. We evaluate our performance in comparison with
state-of-the-art CPU and GPU implementations. We outperform them with
a geometric mean speedup of 10, 678× and 8.56× from CPU and GPU imple-
mentations respectively. We also show that the proposed approach can effi-
ciently process the graphs that previously cannot be processed by these state-
of-the-art implementations due to their excessive memory requirement.
ii
To my family, for their eternal love and support.
iii
ACKNOWLEDGMENTS
I wish to express my deepest gratitude to my advisers, Professor Rakesh
Nagi and Professor Wen-mei Hwu, for their guidance and support. They
have always been sources of wisdom and motivation. I am very grateful to
have found the opportunity to work under two amazing mentors.
I would like to extend my thanks to my colleagues at the IBM C3SR group
for all their help and support. I would especially like to thank Mohammad
AlMasri for his mentorship and support throughout the process and Vikram
Mailthody for introducing me to the opportunity and his support throughout
my master’s degree.
My gratitude is also extended to my previous advisor, Professor Shobha
Vasudevan, who taught me many important lessons and the University of
Illinois Urbana-Champaign for the opportunity to learn and grow in one of
the best universities in the world.
A special thanks to my parents and family for their unconditional love and
support. The amount of gratitude for them cannot be expressed in words.
I would also like to thank my friends, especially Anjana Suresh Kumar,
for all the fun and support in this foreign land.
iv
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 3
2.1 Notations/Definitions . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Graph Representation . . . . . . . . . . . . . . . . . . . . . . 4
2.4 GPU Architecture Background . . . . . . . . . . . . . . . . . . 5
CHAPTER 3 RELATED WORK . . . . . . . . . . . . . . . . . . . . 8
3.1 CPU-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . 8
3.2 GPU-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 4 PREPROCESSING . . . . . . . . . . . . . . . . . . . . 12
4.1 Query Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Symmetry Breaking . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Data Graph Preprocessing . . . . . . . . . . . . . . . . . . . . 16
CHAPTER 5 SEARCH TREE . . . . . . . . . . . . . . . . . . . . . 18
CHAPTER 6 BUILDING BLOCKS: K-CLIQUE COUNTING . . . . 23
6.1 K-Clique Counting versus Subgraph Enumeration . . . . . . . 23
6.2 Parallel DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Binary Encoding for Induced Subgraphs . . . . . . . . . . . . 25
6.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 27
CHAPTER 7 GPU IMPLEMENTATION . . . . . . . . . . . . . . . 28
7.1 Iterative DFS Traversal . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Central Node . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 Thread Management . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 32
CHAPTER 8 PERFORMANCE EVALUATION . . . . . . . . . . . . 34
8.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 37
v
CHAPTER 9 CONCLUSION AND FUTURE WORK . . . . . . . . 49
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50




CPU Central Processing Unit
DFS Depth-First Search








Graphs are widely used to represent a variety of data owing to their high rep-
resentational power and efficiency in capturing complex relations within the
data. Consequently, they are used in numerous applications including but
not limited to social networks, the Internet, chemical engineering, molecu-
lar studies, protein functions, genetic engineering, computer-aided design,
cyber-security, network analysis and many more. One of the fundamental
operations at the core of graph analysis is subgraph enumeration. It aims to
enumerate all instances of subgraphs of a data graph that are isomorphic to
a given template.
Given a data graph GD and a template graph GT , the objective is to find
all subgraphs of GD isomorphic to GT . Subgraph enumeration in the most
general case is provably NP-complete [1]. Combined with ever increasing
sizes of graph data sizes, the problem is computationally challenging. With
the boom in performance of the modern hardware, and interest in smaller
templates, the problem has recently garnered interest for fast, practical effi-
cient implementations.
High performance implementations are often specialized for the hardware
they run on. Parallel solutions have been popular in multi-core CPU and
distributed clusters. However, recently there has been a growing interest
in GPU-based acceleration. GPUs are massively parallel devices based on
the Single-Instruction-Multi-Thread (SIMT) execution model. The differ-
ent execution model of GPU-based from that of the traditional CPU-based
approaches comes with its own challenges and optimizations resulting in con-
trasting approaches across the two domains.
Most CPU-based approaches for subgraph enumerations [2, 3, 4] are based
on the depth-first search (DFS) tree traversals as it is more memory efficient.
In general, DFS leads to irregular memory access patterns and involves back-
tracking leading to divergent execution paths across threads. Consequently,
1
most GPU-based approaches [5, 6, 7, 8] employ breadth-first search (BFS)
which counters the irregularity of DFS and better exploits parallelism at the
cost of exploding memory requirements.
This thesis demonstrates a DFS-based implementation on a GPU for the
subgraph enumeration problem. DFS on GPUs comes with its own chal-
lenges. We utilize various optimizations to counter these challenges includ-
ing hierarchical thread partitioning, better memory management techniques
like persistent memory, shared stack etc. We contrast our performance with
state-of-the-art CPU and GPU implementations and show a geometric mean
speedup of 10, 678× and 8.56× from CPU and GPU implementations respec-
tively.
The remainder of the thesis is organized as follows. Chapter 2 provides
background and introduces notations used in the thesis. Chapter 3 describes
the related prior work in this domain. Chapters 4 to 7 cover the various
aspects of the algorithm and the implementation. Chapter 8 evaluates the
performance of the implementation. Chapter 9 wraps up the thesis with




In this chapter, preliminary background is introduced including basic graph
notation, graph matching definition, and GPU architecture fundamentals.
2.1 Notations/Definitions
A graph G is defined as an ordered pair G = (V,E), where V is the set of
vertices or nodes and E ⊆ V ×V is the set of edges. For edge (u, v) ∈ E, the
vertex u is called the source and vertex v is called the destination. When the
set E is symmetric, that is ∀(u, v) ∈ E, (v, u) ∈ E, the graph is said to be
undirected. A path in a graph is defined as a sequence of edges which joins
a sequence of vertices. A graph is said to be connected if there exists a path
between all pairs of vertices in the graph. A graph is said to be simple if
there are no cycles or multiple edges between the same pair of vertices.
A subgraph g of graph G, denoted by g ⊆ G, is defined as a graph (Vg, Eg)
such that Vg ⊆ V and Eg ⊆ E. Given a set of vertices V ′ ⊆ V , the subgraph
of G induced by V ′, denoted by g(V ′), is defined as the subgraph g = (V ′, E ′)
such that E ′ = {e = (u, v)|e ∈ E;u, v ∈ V ′}.
For a vertex v ∈ V , N (v) is said to denote the neighborhood of v, defined
as N (v) = {u|(v, u) ∈ E}. d(v) = |N (v)| is said to be the degree of v.
n = |V | and m = |E| denote the number of nodes and edges in the graph
respectively. In the subsequent chapters, we refer to the query or template
graph as GT whereas the data graph is denoted as GD.
2.2 Graph Matching
Graph matching refers to the problem of finding “similarity” between two
graphs. It refers to finding a mapping between the nodes of two graphs
3
satisfying certain criteria. Different notions of similarity corresponding to
different criteria gives rise to a variety of flavors for the graph matching
problem. Broadly, they can be divided into categories, namely exact and
inexact matching. Exact matching is characterized by the criterion called
“edge-preservation”. Edge-preservation states that if two nodes are adjacent
in a graph, the corresponding mapped nodes must also be adjacent in the
other graph. Inexact matching, as the name suggests, does not consider the
edge-preserving criterion.
The strictest form of exact graph matching is called graph isomorphism.
Graph isomorphism constrains the node mapping to be bijective, that is there
is one-to-one correspondence between the nodes of the two graphs. Weaker
matching is obtained by relaxing the bijective condition. An injective node
mapping between the graphs is referred to as graph monomorphism, whereas
a surjective node mapping is referred to as graph epimorphism. When both
the injective and surjective conditions are dropped, we get graph homomor-
phism.
Weaker forms of exact graph matching can also be obtained by matching
subgraphs of one graph to the other graph. The most common form of such
“subgraph” matching is called subgraph isomorphism. Given two graphs GD
and GT , subgraph isomorphism refers to finding subgraphs of GD that are
isomorphic to GT . The graph GD is often referred to as the data graph, while
graph GT is often referred to as the query graph or the template graph.
2.3 Graph Representation
Various data structures are used to represent graphs in the computing world.
The two most elementary data structures are Adjacency List and Adjacency
Matrix. Adjacency List is an array of linked lists where the ith element of
the array stores a linked list of the neighbors of node i. Adjacency Matrix
(represented as A) is an n× n matrix where
Aij =
1 (i, j) ∈ E0 otherwise
.
Since the adjacency matrix is often sparse, it is usually represented in a
4
(a) Example graph (b) Compressed sparse row (CSR)
(c) Adjacency list (d) Adjacency matrix
Figure 2.1: Graph representations
sparse matrix format. Various sparse matrix formats exists such as Coor-
dinate (COO), Compressed Sparse Row (CSR), Compressed Sparse Column
(CSC), Jagged Diagonal Storage (JDS), etc. The most commonly used for-
mat in literature and the one we use is CSR. CSR uses two arrays – Column
index and Row Pointer. Column Index array stores the non-zero column
indices of each row sequentially. Row Pointer array stores a pointer or in-
dex into the column index array for each row. An additional data array is
optionally used to capture any edge labels or weights. Figure 2.1 shows the
different representations for an example graph.
2.4 GPU Architecture Background
2.4.1 Thread Organization
GPUs are high-performance devices designed to provide high memory and
computational throughput with hundreds of cores (grouped into what is
called Streaming Multi-processors or SMs) packed together in a single device.
GPUs work on an enormous number of threads in parallel. These threads
5
are grouped into “blocks” that are distributed over to SMs. Each “thread
block” is comprised of smaller groups of threads called “warps”. A typical
warp comprises of 32 threads on NVIDIA GPUs. In general, all threads in
a warp all execute the same instruction simultaneously in an SIMT (Single
Instruction, Multiple Threads) fashion. However, this may not hold with the
involvement of “Independent Thread Scheduling” as explained later.
2.4.2 Memory Organization
The logical GPU memory hierarchy consists of global memory, shared mem-
ory and registers. Global memory, also referred to as the device memory,
refers to the high-bandwidth memory connected to the GPU chip. It is
accessible from all threads executing on the GPU as well CUDA APIs to
transfer data between the host (CPU) and the device (GPU). Shared mem-
ory is shared across threads within a block. It is smaller and faster than
global memory but bigger and slower than registers. Each thread has ac-
cess to its local registers that are not shared with other threads. If a thread
requires more registers than are physically available, local memory is used
to store the additional registers which has the same access latency as global
memory. Consequently, register usage is crucial in GPU kernel programming.
The GPU memory organization is summarized in Fig. 2.2.
Figure 2.2: GPU memory organization
6
Another performance critical consideration is the memory access pattern.
When a thread warp executes a load instruction, multiple loads are issued
(one for each thread). Since the memory subsystem cannot process these
loads in parallel, these get serialized. Since warps execute simultaneously
the total time for the load increases by a factor of number of threads in
the warp (usually 32). However, when the memory accesses are on adjacent
memory locations, the loads can be combined into a single load fully utilizing
the DRAM burst. Such a memory access pattern where threads in a warp
access consecutive memory locations is said to be Coalesced and can have a
significant performance impact.
2.4.3 Block Scheduling and Persistent Memory Model
Kernels are usually launched with more thread blocks than that can be ex-
ecuted concurrently. Once a block is scheduled to run on an SM, it is never
preempted. That is, it is never scheduled out until completion. It provides
a scope for clever memory utilization as memory need not be allocated for
all blocks in the grid. Instead, memory can be allocated only for blocks that
can run concurrently. This concept is called Persistent Memory and can help
drastically reduce memory allocation, especially for large grid kernels.
2.4.4 Independent Thread Scheduling
NVIDIA GPUs introduced the concept on Independent Thread Scheduling
on Volta Architecture [9]. This allows threads within a wrap to synchro-
nize independently of other threads in the warp. This allows for fine-grain
parallelization within a warp which traditionally executed together in SIMD
(Single Instruction, Multiple Data) fashion. The fine-grain synchronization is
enabled through new warp level primitive called warpsync(mask). It takes
in a 32-bit mask to select threads that need to be synchronized. For example,





Subgraph isomorphism is an age-old problem with solutions dating back to
the 1970s. Ullmann in 1976 [2] provided a brute-force tree-search enumer-
ation algorithm which forms the basis of many present-day solutions. The
algorithm is equivalently a depth-first search algorithm. Each isomorphism
instance can be encoded in a binary matrix of size |VT |× |VD| such that each
row (for an element of VT ) contains exactly one 1 and each column (for an
element of VD) contains at most one 1. The algorithm recursively enumerates
all possible such matrices by starting with an initial matrix and pruning ones
recursively to satisfy the aforementioned conditions.
The algorithm has since seen significant improvements. Ullmann provided
an improvement to the algorithm in 2011 [10] based on bit-vector reductions.
VF2 (2004) [3] by Cordella et al. enhanced Ullmann’s algorithm with various
heuristics to improve memory usage. This was enhanced further by Carletti
et al. with VF3 [4, 11, 12] in 2017-18 to catch up with the ever-increasing
scale and diversity of data graphs.
There are alternative approaches other than traversing search trees avail-
able in the literature. Graph indexing based methods [13, 14, 15, 16, 17]
extract feature-based indices from the template graph. The indices are used
to search in a filtering-and-verification strategy. Similar indices are com-
puted in the data graph and is used to prune negative results (filtering). The
remaining candidate set from the data graph is then “verified” with respect
to the template graph. Some papers [18, 19, 20] model the subgraph isomor-
phism problem as a constraint satisfaction problem. The objective is to find
a mapping or assignment of vertices that satisfy the constraints formulated
by the query graph.
8
These algorithms, however, are designed for faster sequential execution.
The problem has huge potential for parallelism and many parallel algorithms
have been published over time. McCreesh et al. [20] proposed a parallel al-
gorithm for multi-core CPUs based on constraint satisfaction formulation of
the problem. Carletti et al. proposed VF3P [21] which is a parallel imple-
mentation of VF3-Light [12] on multi-core CPUs.
Multi-core CPUs are limited in the amount of parallelism they offer. To
exploit more parallelism, distributed clusters and GPU-based approaches
have been proposed. Owing to the differences in their architectures, the
approaches are quite dissimilar as well. While GPU-based approaches follow
a BFS-style tree search algorithm, distributed implementations commonly
employ relational database (RDB) style joins.
In RDB-style join approach, the query graph pattern vertices are treated
as attributes and the partial results as relational tables. The search algo-
rithm boils down to recursive joins of the tables with the attributes. The
focus of optimization in this approach is the search plan or the order in
which attributes are joined. TwigTwigJoin [22], SEED [23], StarJoin [24]
and GenericJoin [25] are a few recent join algorithms. Lai et al. [26] pro-
vides a comprehensive walk-through of the multiple join algorithms along
with analysis of their performance.
Some traversal approaches also exist in the distributed environment. PSgL
[27] and CECI [28] are a couple of BFS-traversal based approach. Ma et al.
[29] demonstrate a message passing based traversal approach on distributed
systems.
3.2 GPU-Based Algorithms
In recent times, interest has shifted toward GPU implementations owing to
the high data throughput and massive parallelism on a single device. The
difference in the parallelism model of GPUs compared to that of traditional
CPUs has caused a shift away from traditional backtracking algorithms. The
most common approach in GPU implementations has been BFS-traversal.
GpSM [5] is one of most popular state-of-the-art GPU implementation
of subgraph isomorphism. They follow a breadth-first filtering-and-joining
strategy. They match edges of data graph to the edges of the query graph
9
(filtering) and then joining them iteratively to generate partial results until
they reach the final solution. However, this requires storing of considerably
larger number of intermediate results leading to high memory requirement.
Consequently, their approach is limited to small graphs.
GSI [6] is a recent work based on breadth-first filter and join strategy. Their
major contribution is an efficient data structure called PCSR (Partitioned
Compressed Sparse Row) to represent edge-labeled graphs. While it helps
improve memory performance, it incurs a pre-processing cost. With PCSR
and other specific optimizations, the algorithm is extremely efficient for la-
beled subgraph isomorphism but is known to not scale well in the unlabeled
case.
GunRockSM [7] is another recent work based on a similar strategy. Their
approach focuses on the memory issues associated with storage of interme-
diate results. They bound the memory usage by statically allocating the
memory for intermediate results at the cost of inaccurate counts. Further-
more, the memory bound is quadratic in the data graph size (number of
nodes), causing the approach to be limited to fairly small graphs. The per-
formance they report is impressive on small graphs, however it cannot be
compared evenly to the other exact matching algorithms.
RPS [8] is another recent work that builds on GpSM and points out that
the bottleneck of performance is the set intersection operation. They observe
that a significant portion of these operations are redundant and proposes a
solution to improve the reuse of intersection results. Another recent work
from the same authors is PBE [30]. While the cornerstone of their contri-
butions is around subgraph enumeration on partitioned graphs and using
CPU-GPU heterogeneous computing, their implementation performs well on
single GPU for unpartitioned graphs too.
Other flavors of subgraph isomorphism problems exist such as motif count-
ing, frequent subgraph mining, SPARQL queries, etc. These utilize simi-
lar approaches to the ones mentioned above along with other specific opti-
mizations. A trivial simplification of the problem is to have algorithms to
search for certain important patterns such as triangles [31, 32, 33, 34], trusses
[34, 35, 36], and cliques [37]. The Graph Challenge [38] has been a key driving
force in this domain. Various innovative optimizations have been proposed
for solving these specific problems including graph orientation, peeling, fast
set intersection, etc.
10
Taking lessons from the above mentioned approaches, we propose a novel
DFS-based subgraph enumeration implementation in this thesis. Although
DFS on a single tree is hard to parallelize, DFS can be executed concurrently
on multiple sub-trees. Moreover, DFS solves the exploding memory require-
ments of BFS by not having to store all intermediate results simultaneously.
We utilize various other techniques and optimizations from across domains




Before actually building and traversing the search tree, the algorithm takes
in the template graph and processes it to extract useful information to help
remove redundancy and prune the search tree. Owing to the small size of
the template graphs, template processing can be performed on the CPU
with almost no overhead. Apart from processing the template graph, the
data graph is also pre-processed on the GPU to prune out nodes and edges
that are guaranteed to not contribute to the result. The pre-processing step
is kept simple and lightweight. This chapter provides details of all the pre-
processing that happens before the search tree is formulated.
4.1 Query Sequence
As a first step, a node ordering of the template graph is generated. This
node ordering also forms the sequence in which template nodes are matched
to the data graph. The ordering is chosen in a way to prioritize nodes which
are more likely to find fewer potential matches to constrain the size of the
search tree.
The algorithm explained in Algorithm 1 is inspired from Maximum Like-
lihood estimation proposed in VF3 [4]. They consider the following three
factors in decreasing order of priority to compute the node sequence, ST .
• Node mapping degree, dM . The node mapping degree of a node u is
defined as the number of edges of u with nodes already in the sequence
ST . That is,
dM(u) = |(u, v) : v ∈ ST |.
The idea is to constrain the number of candidates by limiting it to
intersection of adjacency of already matched nodes. It also ensures
12
that the match is an induced subgraph of the first node limiting the
search space. Template nodes with higher dM value appear earlier in
the matching sequence of the template graph.
• Match probability, Pf . The probability of finding a match for a
node u in the data graph GD is obtained by combining the probability
of finding a node in GD with a label l (Pl(l)) and the probability of
finding a node with degree d (Pd(d)). They compute Pf as follows,





where λVT (u) : VT −→ LT is the labeling function. Template nodes
with smaller Pf values appear earlier in the matching sequence of the
template graph.
• Degree. If multiple nodes end up having same dM and Pf , the tie is
broken by the degree of the node. Nodes with higher degrees appear
earlier in the matching sequence of the template graph. If they have
the same degree, the tie is broken randomly.






Without loss of generality, consider two template nodes, u and u′ such that
d(u) < d(u′). On an unlabeled graph, Pf (u
′) < Pf (u). Hence, priority
based on match probability essentially boils down to priority based on degree.
Moreover, match probability is dependent on the data graph as well which can
be computationally very expensive owing to large sizes of the data graphs. We
adopt VF3’s algorithm for our node query sequencing, but we do not consider
the match probability, Pf for the reasons mentioned above. We present a
step-by-step walkthrough of Algorithm 1 with the help of an example in
Appendix A.
13
Algorithm 1: Template graph preprocessing
Input: Template Graph, GT
Output: Node Query Sequence, ST
1 dM [nT ]← 0;
2 ST ← φ;
3 for i← 0 to nT do
// Compute likelihood for all nodes not in ST
4 for j← 0 to nT do
5 if j not in ST then
6 likelihood[j]← dM [j] · nT + degree[j];
7 end
8 end
// Find node with maximum likelihood and it to ST
9 idx ← argmax(likelihood);
10 append idx to ST ;
// Update dM
11 foreach neighbor j of node idx do




Given a query sequence ST = {u1, u2, . . . , unT }, a match in the data graph
can be represented as the sequence {v1, v2, . . . , vnT } such that vi ∈ VD is
matched with ui ∈ VT . Consider an example data and template graph
shown in Fig. 4.1. Consider the template node ordering sequence as obtained
by Algorithm 1, {T0, T1, T2, T3}. One would expect to find two matches in
the given data graph, namely {D0, D1, D2, D3} and {D0, D2, D1, D4}. How-
ever, we would also find the following matches as well {D0, D1, D3, D2},
{D1, D0, D2, D3}, {D1, D0, D3, D2}, {D0, D2, D4, D1}, {D2, D0, D1, D4} and
{D2, D0, D4, D1}. They are distinct in terms of the sequence, but clearly rep-
resent the same structure. This is owing to the symmetry, or Automorphism
of the template graph.
An automorphism of a graph is defined as an isomorphism of the graph
to itself. More formally, an automorphism of a graph G = (V,E) refers
to a permutation σ of the vertex V such that a vertex pair (u, v) for an
edge if and only if (σ(u), σ(v)) also form an edge. The composition of two
automorphisms gives another automorphism. The set of automorphisms of
14
Figure 4.1: Example template graph (left) and data graph (right)
a given graph forms a group called the “automorphism group” of the graph.
The equivalence class of vertices of the graph G under this group is called
“orbit”.
Algorithm 2: Symmetry breaking
Input: Template Graph, GT
Output: Set of partial node orderings, PT
1 A ← GetAutomorphismGroup(GT);
2 while A 6= {I} do
3 OA ← GetOrbits(A);
4 Pick largest orbit {v1, v2, . . . , vk} from OA;
5 Add v1 < v2, v1 < v3, . . . , v1 < vk to PT ;
6 A ← {α|α ∈ A, α(v1) = v1}
7 end
As demonstrated in the example above, the consequence of the automor-
phism of the template graph is redundant counts. To counter the redun-
dancy, we apply a symmetry breaking technique similar to that proposed by
Grochow et al. [39]. The idea is to assign a partial order to the nodes of
the template graph such that there is no other permutation σ that satisfies
the partial order and is isomorphic to itself. The algorithm is outlined in
Algorithm 2. For our implementation, we use a framework called NAUTY
[40] which provides the functions equivalent to GetAutomorphismGroup and
GetOrbits.
For the template graph shown in Fig. 4.1, the automorphism group of the
graph is as follows: {I, (T0, T1), (T2, T3)}. We get two orbits for this simple
group, namely {(T0, T1), (T2, T3)}. We pick the first orbit and assign the
partial order T0 < T1. After reducing the group as in Algorithm 2, line 6, we
get the new group as {I, (T2, T3)} with orbit {(T2, T3)}. We now assign the
15
partial order T2 < T3.
To eliminate the duplicate matches, an ordering of the data graph is as-
sumed. The partial order of the template graph is then enforced on the
matches as well. To elaborate with the example, consider a simple lexico-
graphical ordering of the data graph such that Di < Dj, ∀i < J . Hence,
of all matches for the example listed above, the ones fitting the pattern
{D1, D0, , } or {D2, D0, , } can be eliminated by enforcing the partial or-
der T0 < T1. Similarly, all matched fitting { , , D3, D2} or { , , D4, D1} can
be eliminated by enforcing the partial order T2 < T3. This leaves us with
only the two matches we expected to find in the first place.
A key point to note is that different ordering of data graphs can be used
for breaking different pairs of symmetrical templates. In the above case,
for example, we could use Di < Dj, ∀i < j to break T0 < T1 and use
something else, say Di < Dj, ∀i, j : degree(i) < degree(j) to enforce the
partial ordering T2 < T3. This observation will be used in Chapter 5 to help
prune the search tree.
4.3 Data Graph Preprocessing
We apply a basic peeling on the data graph to peel away nodes which are
guaranteed to not contribute to any matches as explained in Algorithm 3.
This helps in reducing the size of the graph as well as its maximum degree.
Reduction in graph size and maximum degree helps reduce the size of the
search tree as well as the size of the adjacency lists over which intersections
are performed. Given the minimum degree among the nodes of the template
graph, say dT,min, all nodes of the data graph with degree d < dT,min cannot
be matched to any node in the template graph and hence can be removed.
After removal of these nodes, and updating the degrees of the graph, we
can perform this procedure again on the updated graph. This procedure
can be repeated for multiple iterations. The stopping condition we use is
heuristically chosen to be when the size of NodeFilterSet falls below a
certain threshold to balance between the amount of pruning and time taken.
We choose the threshold value as 5% of the size of the graph.
16
Algorithm 3: Peeling data graph
Input: Data graph, GD = (VD, ED), Minimum Degree of template
graph, dT,min
1 repeat
2 NodeFilterSet ← {u|u ∈ VD, Degree(u) < dT,min};
3 EdgeFilterSet ← {(u, v)|u ∈ NodeFilterSet or v ∈ NodeFilterSet};
4 VD ← VD \ NodeFilterSet;
5 ED ← ED \ EdgeFilterSet;




In this chapter, we walk through an example to get a better understanding of
what the search tree looks like and how it is constructed and pruned. The ex-
ample data and template graphs considered are shown in Fig. 5.1. Although
we demonstrate the construction of the tree in a breadth-first method, it is
important to note that it is traversed in depth-first fashion.
(a) Template graph (b) Data graph (c) “Peeled” data graph
Figure 5.1: Example graphs to demonstrate search tree building process
As explained in Chapter 4, before the search tree is constructed, both
template and data graph are pre-processed. The pre-processed (or “peeled”)
data graph is shown in Fig. 5.1(c). The query node sequence obtained from
Algorithm 1 on this template is ST = {T0, T1, T2, T3} and the set of partial
ordering to break symmetry of the template are {T0 < T1, T2 < T3}.
The nodes at level i of the tree is considered a potential match to template
node ST [i]. The root of the tree is considered to be at level one. Consequently,
the final matches are obtained by paths between root and the deepest leaves.
At the start of the process, all nodes in the “peeled” data graph are candi-
dates for a match for ST [1] = T0. We know that degree(T0) = 3, consequently
18
all data nodes with degree less than 3 can be eliminated. We call this elim-
ination process, “Degree filtering”. After degree filtering, we are left with
the following candidates for match with T0: {D0, D1, D2, D3, D4, D5}. These
nodes form the roots of different search trees.
For brevity, let us consider a couple of search trees for the rest of this
example, say the ones rooted at D1 and D3. The search tree construction for
these trees is demonstrated in Fig. 5.2. For all following levels, the candidate
nodes are decided by intersecting adjacency lists of “backward neighbors”
for that level. The “Backward Neighbor” of a template node ST [i] is given
by BN [i] = {j|(ST [j], ST [i]) ∈ ET}. At level two, the backward neighbor of
ST [2] = T1 is only T0. Hence, candidates at level two of the tree are simply
the adjacency lists of node at level one as shown in Fig. 5.2(b).
Not all of these candidates will contribute to matches and can be pruned
from the search tree. Since level two matches with template node ST [2] = T1
that has degree three, nodes with degree less than three can be pruned.
Degree-filtered nodes are shown with “yellow” crosses in Fig. 5.2(c). An-
other pruning strategy as discussed in Chapter 4 is symmetry breaking. For
demonstration purposes, we use lexicographical ordering (Di < Dj, ∀i < j).
However, we use degree ordering in our implementation which is explained in
Chapter 7. Nodes pruned due to symmetry breaking are shown with “blue”
crosses in Fig. 5.2(c).
Similarly, we find candidates for level three by intersecting adjacency lists
of the backward neighbors of ST [3] = T2. The backward neighbors of T2 are
ST [1] = T0 and ST [2] = T1. Hence, level three candidates are obtained by
the intersection of nodes at level one and two in that branch as shown in
Fig. 5.2(d), we can perform pruning similar to as explained above. However,
it is noted that no nodes are pruned at this level in the example we consider.
Finally, we intersect backward neighbors of ST [4] = T3, BN [4] = {ST [1] =
T0, ST [2] = T1} to obtain candidates for level 4 as shown in Fig. 5.2(e). A
key observation is that the intersection is the same as that at level three.
Consequently, it can be observed that within the same branch of the tree,
the same node appears twice, meaning that a single node matches to two
nodes in the template graph. This contradicts the definition of isomorphism
and is deemed as an invalid match. Such matches are eliminated as shown
with “red” crosses in Fig. 5.2(f). It can be noted that degree filtering does not
prune any nodes at this level in this example. Recall that symmetry breaking
19
Figure 5.2: Demonstrating search tree construction
20
for different partial orders can use different ordering of the data nodes. We
use a different ordering scheme for levels that do not involve the root node.
The ordering is obtained by which node is visited first in the tree. To break
the symmetry between T2 < T3, for example, if a node being matched in
a branch at level 3 and it has been visited at level 2 prior to the sub-tree
its being matched, it is being excluded. We mark the node excluded due to
symmetry breaking with “purple” crosses in Fig. 5.2(f). The symmetrical
location is pointed with “purple” arrows.
The final search trees for all nodes in shown in Fig. 5.3. As mentioned
before, the final solutions are obtained by paths from root to the deepest
leaves. The final matches for this example as denoted by such paths in Fig.
5.3 correspond to the matches {{D0, D3, D1, D2}, {D1, D3, D0, D4}}.
21
(a) Rooted at D0 (b) Rooted at D2
(c) Rooted at D1 (d) Rooted at D4
(e) Rooted at D3 (f) Rooted at D5





As demonstrated in Chapter 3, most approaches for subgraph enumeration
follow a breadth-first search strategy. A notable exception to this is [37] that
employs DFS traversal to enumerate K-Cliques in the data graph. The au-
thors of [37] explain the parallelization strategies used to efficiently parallelize
a seemingly serial computation. They enlist various other optimizations such
as induced subgraph binary encoding, graph orientation, pivoting, memory
management, etc., some of which can be applied to subgraph enumeration
as well. We leverage this work as the foundation and generalize it to target
a general subgraph instead of cliques. This chapter summarizes their work
and the optimizations that we incorporate in our work.
6.1 K-Clique Counting versus Subgraph Enumeration
A K-Clique is a complete subgraph of k vertices and k(k − 1) edges such
that every vertex is connected to every other vertex. Owing to the fully
connected structure and all-symmetric nature of cliques, many optimizations
can be applied to extensively prune the search tree. Not only this, the next
level for the clique search tree can be obtained by only a single intersection.
However, not all subgraphs exhibit such symmetrical structures and a general
approach often needs to visit all the nodes in the search tree which maybe
order of magnitudes higher in count.
Furthermore, a major contribution to the performance of k-clique count-
ing is called “graph orientation”. Graph orientation refers to converting an
undirected graph into a directed one by “orienting” the edges in a particu-
lar direction. This greatly helps in reducing the out-degree of the vertices.
This reduces the size of adjacency lists and helps improve performance of
intersections and the size of the search tree. The significant improvement
23
in performance and memory utilization due to orientation is only possible
because of the inherent structure of cliques and cannot be generalized to
any other template. Consequently, subgraph enumeration is a much harder
problem to solve efficiently than that of k-clique counting.
6.2 Parallel DFS
To the best of our knowledge, [37] is the first paper to employ a DFS-based
search tree traversal on GPUs. However, as is generally agreed, DFS is
inherently a sequential traversal and cannot be easily parallelized [41]. The
authors do not challenge the claim and exploit the fact that DFS of each
subtree is an independent operation and can be performed in parallel. The
core idea is to give a group of threads one subtree to traverse. They use
shared memory space to keep track of the traversal. All threads in the group
traversing the same subtree might seem redundant, but having a group of
threads helps provide fine-grain parallelism to perform intersections.
Almasri et al. [37] follow a hierarchical thread partitioning pertaining to
the thread organization to divide the trees and sub-trees across different
blocks and thread groups. They experiment with vertex-centric and edge-
centric parallelization strategies. In vertex-centric strategy, the first level or
the root of each tree is handed over to a different thread block. In the edge-
centric strategy, on the other hand, each subtree rooted at the second level
of each tree is handed over to a different thread block. Within the thread
blocks, different thread groups work on the subtrees of the next level. Due to
independent thread scheduling as explained in Section 2.4.4, they can choose
size of the thread group to be different from the warp size. They experi-
ment with different group sizes as well. The different thread partitioning
approaches are summarized on an example tree (from Chapter 5) in Fig. 6.1.
They note that vertex-centric partitioning works better on small values of
k (or smaller templates) while edge-centric performs better as k increases.
This is expected as the vertex-centric partitioning amortizes the cost of ex-
tracting the induced subgraph, while edge-centric partitioning extracts more
parallelism and is more robust to load imbalance. Since the search tree is
shallower for smaller templates, vertex-centric approach is more suitable. As
size of the template graph grows, so does the load imbalance. This makes
24
Figure 6.1: Thread partitioning strategies
edge-centric approach more suitable for larger template graphs.
Apart from the partitioning of threads to independently traverse different
sub-trees, they also address the issues with recursive implementation of the
DFS algorithm. Given the huge number of threads on the GPU, each thread
gets a fairly small stack space. Consequently, the depth of recursion each
thread can go to gets extremely limited. Moreover, since a group of threads
traverse the same sub-tree, maintenance of state across these threads is an-
other issue. For these reasons, they implement DFS in an iterative fashion.
6.3 Binary Encoding for Induced Subgraphs
Adjacency list intersection occupies a significant proportion of the total com-
pute time and is often the governing factor in the implementation’s perfor-
mance [8]. A common optimization is to shrink the size of the adjacency
lists being intersected for each search tree by removing the vertices from the
25
graph that will never be reached in the search tree. Since K-cliques are com-
plete graphs, all nodes that are not the neighbor of the root node, will never
be reached in the search tree. In other words, all matches will be within
the subgraph induced by the root node. Consequently, the adjacency lists
used for intersection can be reduced to only nodes present in the induced
subgraph. The induced subgraph is usually much smaller than the entire
graph with smaller adjacency lists leading to faster intersections.
Almasri et al. [37] further optimize the computation by computing the set
intersections exactly once for each tree to compute the induced subgraph.
They encode the induced subgraph into a bit vector. This reduces all future
intersections into bitwise AND operations, which are significantly faster than
set intersections. It also helps reduce the memory required to store inter-
mediate results as each node is now represented by a bit instead of a 32-bit
integer.
Figure 6.2 provides an example to demonstrate how binary encoding is
generated. Consider the graph shown in Fig. 6.2(a). D is assumed to be
the root node, and we encode with respect to its adjacency. The subgraph
induced by node D is shown in Fig. 6.2(b). To compute the encoding of
adjacency of node A, for example, the adjacency of that node is formulated
with respect to the induced subgraph, as reported in Column 3 of Fig. 6.2(c).
This list is then encoded in a binary vector, where bit i is set to 1 if the ith
node in the reference list (adjacency of D in this case) is present in the
obtained adjacency and 0 otherwise. The last column of Fig. 6.2(c) shows
the computed binary encoding of various nodes assuming the ordering of the
reference list as {A,B,C,E,G}.
The above explanation assumes the vertex-centric approach of their im-
plementation. However, the idea can be easily generalized to encode with
respect to the list of candidate nodes that can be found in the sub-tree. For
example, in the edge-centric approach, the nodes that can be reached in sub-
tree corresponding to the root edge are found by the intersection of adjacency
lists of the ends of the root edge. This is what the authors use to encode in
their edge-centric implementation.
26
(a) Sample data graph (b) Subgraph induced by
root node
(c) Adjancency lists and encoding
Figure 6.2: Binary Encoding Example
6.4 Memory Management
Memory management is critical for any GPU-based application due to high
bandwidth requirement by hundreds of threads running concurrently. Fur-
thermore, in comparison to CPUs, GPUs have limited memory size. Hence,
the authors of [37] spend a lot of their focus on efficient memory management
in their implementation. Many facets of their implementation although not
directed toward memory, contribute to better utilization such as extracting
induced subgraph only once per tree, binary encoding, sharing state space
among threads for DFS, etc. Over and above this, they exploit persistent
memory as explained in Section 2.4.3 to allocate memory only for blocks
that can run concurrently and not for all blocks. This reduces memory by




In this chapter, we bring together the components introduced in the previous
chapters with finer implementation details. We explain the tree traversal
algorithm in Section 7.1. We introduce some constraints on template graphs
to maximize performance as elaborated in Section 7.2. The rest of the chapter
discusses how memory and compute resources are distributed to execute the
algorithm in parallel.
7.1 Iterative DFS Traversal
Traditional recursive DFS traversals do not scale well on GPUs due to the
limited stack space. Consequently, we flatten the recursion into an iterative
traversal. Algorithm 4 summarizes the iterative DFS algorithm that is used
to traverse a given sub-tree and returns the count of matches found in that
sub-tree.
The state of the search tree is maintained with the help of following vari-
ables:
• `: The current level of the tree
• matches: An array of arrays to store the matched vertices at each level
• num matches: An array to keep length of matches[`] for each level
• curr idx: An array to keep an index into the matches array for each
level; this index points to the current node being processed at that level
The algorithm takes in the initial state of the search tree at the initial
level, `init, along with the template graph and its pre-processed information
(node query sequence and backward neighbors). It also takes in the induced
subgraph Gind, which is essentially the subgraph induced by the vertices
28
Algorithm 4: Iterative DFS traversal
Input: GT = (VT , ET ), ST , Subgraph induced by the root Gind, set
of nodes in Vind with lower ordering than the root Vorient,
Initial level `init, initial set of vertices at that level
matchesinit, size of this set num matchinit
Output: Number of matches found in this sub tree, Count
1 `← `init, matches[`]← matchesinit, num match[`]← num matchinit;
2 curr idx[`]← 0, Count← 0;
3 while curr idx[`] < num match[`] do
4 matches[`+ 1]← Vind \matches[`][curr idx[`]];
5 for k ∈ backNeighbors(ST [`+ 1]) do
6 u← matches[k][curr idx[k]];
7 matches[`+ 1]← matches[`+ 1] ∩ (Adj(u) wrt Gind);
8 end
9 for k ∈ SymmetricalLevels(`+ 1) do
10 if k > 0 then
11 matches[`+ 1]← matches[`+ 1] \matches[k][0 : curr idx[k]];
12 end
13 else
14 matches[`+ 1]← matches[`+ 1] \ Vorient;
15 end
16 end
17 num match[`+ 1] = |matches[`+ 1]|;
18 if num match[`+ 1] > 0 and `+ 1 < k then
19 `← `+ 1 ; /* Go to child */
20 Vind ← Vind \matches[`][curr idx[`]];
21 curr idx[`]← 0;
22 end
23 else
24 if l + 1 == k then
25 Count← Count + num match[`+ 1];
26 end
27 curr idx[`]← curr idx[`] + 1 ; /* Go to sibling */
28 while curr idx[`] == num match[`] and ` > `init do
29 `← `− 1 ; /* Go to parent */
30 Vind ← Vind + matches[`][curr idx[`]];





that are reachable in the tree is being traversed. For the “central node” case
(explained in Section 7.2), Vind essentially reduces to the adjacency of the
root node. As explained in Chapter 5, we use different ordering for symmetry
breaking with root and other nodes. The ordering for symmetry breaking
with root vertex is provided as an input to Algorithm 4. Since the search tree
is limited to nodes in Gind, the ordering is also only required with respect to
nodes in Gind. All nodes in Vind that have lower ordering than the root node
are provided in the list called Vorient. In our implementation, we say Di < Dj
if degree(Di) > degree(Dj) or when they have same degree, if i < j. This is
called degree ordering.
First, the routine takes in the parameters and initializes the state of the
traversal (lines 1-2). The next step is to loop until there are no more nodes
to process (line 3). The loop starts by initializing the potential candidates
for the next level to the set Vind minus the nodes that have already been
visited (line 4). The candidates are then pruned by intersecting with the
adjacency lists of the back-neighbors of the node being matched (lines 5-8).
Once potential candidates have been obtained, we remove redundancies by
breaking symmetry as explained in Chapter 5 (lines 9-16). At this stage,
we have the matches we need for the next level. If the count of matches is
non-zero at a non-bottom level, we increment ` to traverse the children (lines
18-22). We also remove the current node from the set Vind since it cannot be
matched by its children. If the bottom level is reached, we add the number
of matches to the final count (lines 24-26). We increment the index for the
level to move to the sibling of the current node (line 27). If we do not have
a sibling to visit, we move back to the parent, add parent back to Vind and
move to parent’s sibling (lines 28-31).
7.2 Central Node
The search space in the subgraph isomorphism problem is essentially un-
bounded for a general template. This makes the problem extremely challeng-
ing. However, most practical applications of subgraph isomorphism focus on
dense templates. Similar to [37], we note that binding the search space can
provide immense performance improvements, especially on the GPUs due to
enhanced memory utilization.
30
Hence, we focus our performance optimization on the template graphs to
those containing at least one “Central node”. A central node is defined as
a node in the template graph that is connected to to all other nodes. More
formally, the set of central nodes is given by {u|u ∈ VT , degree(u) = |VT |−1}.
Figure 8.1 shows the various template graphs we use in our evaluation and
marks the central nodes for each of them.
It is noted that the algorithm explained in the previous chapters works
for a general template and this constraint is introduced solely for perfor-
mance improvements in the implementation. One can implement the same
algorithm for general graphs at the cost of some performance.
Under the assumption of central node, the search space for each match
is limited to the adjacency of the data node matched to the central node.
Algorithm 1 guarantees that if a central node exists in the template, it would
be the first node chosen. Consequently, for all template graphs with a central
node, the search space is bounded to the adjacency of the root node.
We exploit this observation to utilize the binary encoding as discussed
in Section 6.3. With binary encoding, many of the variables in Algorithm
4 turn into bit-vectors instead of a list of integers. These include Vorient,
matches and Adj(u) wrt Gind. Consequently, all set operations throughout
the algorithm (Lines 4, 7, 11, 14, 20, 30) reduce to bitwise operations. Bit-
wise operations are significantly faster than set operations. Along with the
memory efficiency of bit-vectors, binary encoding results in huge performance
boosts.
7.3 Thread Management
Algorithm 4 only forms a core of the GPU kernel. At the start of the ker-
nel, we compute the bit-vectors corresponding to Vind and Vorient. It then
distributes the second level to groups of threads within a thread block which
independently computes the third level which is then provided to Algorithm
4. While the DFS-state (lines 18-32) are handled by a single thread, set
operations (lines 4-17) are distributed across these group of threads to be
performed in parallel.
Under the central node assumption, set operations boil down to bitwise
operations. For an adjacency of size d, the size of the encoding is of size
31
d/32 (assuming storage in 32-bit integers). Hence, the set operations can
be performed in parallel by d/32 threads. For an adjacency of size 1024
and greater, warp-size (32 threads) efficiently parallelizes the computation.
However, for smaller adjacency lists, many threads go idle. Hence, we select
the partition size based on the max degree of the graph. For graphs with
low max degree, we use smaller partition size of eight threads, whereas for
graphs with large max degree, we use the entire warp.
Parallelism is exploited in a two-step hierarchy in this system. At the top
level is parallelization across blocks, which processes different trees in parallel.
Within each block, different sub-trees of the tree are processed in parallel by
partitioning the threads of the blocks into groups, each of which processes a
sub-tree. As we increase the block size, we increase parallelization within a
block but we reduce the number of blocks that can run concurrently, thereby
reducing parallelization across blocks. For graphs with higher degree, we have
fatter trees with large intersections requiring larger and more thread groups
per block. On the contrary, for graphs with lower degree, we have thinner
trees with small intersections requiring smaller and fewer thread groups per
block. Hence, it makes sense to use larger block sizes (e.g., 1024 threads) for
graphs with higher max degree (e.g., >20k) and smaller block sizes (e.g., 128
threads) for graphs with lower max degree (e.g., <1k).
We also experiment with different parallelization strategies similar to [37].
In the first strategy, we allocate a node to a block, that is each node is
matched to level one of the search tree. The sub-trees at level two are pro-
cessed by thread groups within the block. In the second strategy, we allocate
an edge to a block, that is the nodes at the end of each edge are matched to
levels one and two of the search tree. The sub-trees at level three are then
partitioned across thread groups within the block to execute in parallel.
7.4 Memory Management
Table 7.1 summarizes the memory requirements of the algorithm along with
where the memory is allocated. Nc denotes the number of blocks that can
run concurrently on the GPU, dmax denotes the maximum degree of the data
graph, nT is the number of nodes in the template graph, and P is the number
of thread groups or partitions in the block.
32
Table 7.1: Memory allocation
Data structure Memory hierarchy Memory size
Binary-encoded
adjacency (Gind)
Global Nc × d2max/32
Vorient Global Nc × dmax/32
matches Global Nc × nT × (dmax/32)× P
num match Shared per block nT × P
curr idx Shared per block nT × P
Apart from storing the data graph, the major component of global memory
allocation is the binary-encoded induced subgraphs. Unlike other implemen-
tations (e.g., [7]), where the memory usage depends on the graph size, our
memory allocation depends on the maximum degree of the data graph. Since
real application data graphs are usually very sparse (see Table 8.1), we are




In this chapter, we evaluate the performance of our implementation in com-
parison to the state-of-the-art CPU and GPU implementations in the litera-
ture. We run their implementations on our system to negate speedups pro-
vided by recent hardware. Section 8.1 elaborates on the system and setup
we use to evaluate the performance. We keep this setup common across all
experiments (unless otherwise specified) and report the results in Section 8.2.
8.1 Experiment Setup
8.1.1 Platform
We evaluate our GPU implementations on a dual socket machine with 512 GB
of main memory. Each socket has a 20-core (with two-way hyperthreading)
Intel Xeon Gold 6230 CPU and two V100 GPUs with 32 GB and 16 GB of
HBM2 global memory. The four GPUs are connected through NVLink 2.0.
We only use a single 32GB V100 GPU in our evaluation. We compile our
code with NVCC (CUDA 10.2) and GCC 8.3.1 with the -O3 flag. The CUDA
driver version used is 440.82. For the pre-porcessing filter and exclusive scan
operations, we use NVIDIA CUB 1.8.0 [42]. We use the same platform to
execute the baseline implementations for comparison.
8.1.2 Dataset
We use real world undirected data graphs from the Stanford Network Anal-
ysis Project (SNAP)[43] which are popular benchmarks in the subgraph iso-
morphism domain. The graphs and their characteristics are summarized in
Table 8.1.
34
Table 8.1: Set of data graphs used in experiments
Graph |V | |E| Max degree Size (in MB)
com-dblp 317,080 1,049,866 343 14
cit-patents 3,774,768 16,518,948 793 530
com-lj 3,997,962 34,681,189 14,815 479
com-orkut 3,072,441 117,185,083 33,313 1,688
com-frienster 65,608,366 1,806,067,135 5,214 30,866
8.1.3 Template Queries
We use common template graphs used for evaluation of subgraph isomor-
phism in the literature. Figure 8.1 lists the template graphs queried on the
data graphs listed above.
Figure 8.1: Template graphs used in experiments
35
8.1.4 Baselines
We compare our code to the state-of-the-art implementations on both CPU
and GPUs. We execute these implementations on our platform for a fair
comparison. We use VF3 [4] as our reference for CPU implementations. VF3
does not break symmetry and reports redundant counts. The counts can be
matched by dividing the reported counts by the amount of redundancy in
the template. For GPU baselines, we use the following:
• PBE [30]. Although PBE is designed for partitioned graphs and het-
erogeneous CPU-GPU computing, they perform well on unpartitioned
graph on a single GPU. They follow a “Compute-Materialize” flow
where in at each level they “compute” the intersections to find poten-
tial candidates at that level and “materialize” them into the search
tree. They provide optimizations towards counting by not material-
izing all levels. For enumeration, however, materializing is required.
Hence, we disable counting optimization while comparing enumeration
performance.
• RPS [8]. RPS uses the same code-base as PBE and is run on un-
partitioned graphs on a single GPU. They follow the same “Compute-
Materialize” flow as PBE but differ in their search plan. While PBE
follows a similar search ordering as ours, RPS tries to reuse the inter-
sections computed at a level for its children. They do not utilize the
counting optimizations as mentioned above.
Both PBE and RPS support the same native set of templates. They allow
support for more templates to be added but use some hard-coding on the
traversal plan. For templates not supported by them, we add our own traver-
sal plan to the best of our knowledge and understanding of their approach.
8.1.5 Metrics
For our implementation, the execution times reported include the pre-proce-
ssing time and enumeration/counting times. Unless otherwise specified, we
report the results for faster of the two parallelization strategies explained in
Section 7.3. We do not consider the time taken to read the graph. We also
36
report the split between pre-processing and enumeration. The same is ap-
plicable for times reported for baselines, except VF3 which does not provide
a split between time taken for pre-processing and enumeration. Similar to
RPS [8], we do not report times greater than 3 hours.
8.2 Performance Results
8.2.1 Comparison to State-of-the-Art CPU Implementation
We evaluate the performance of our implementation with respect to the
state-of-the-art CPU implementation (VF3 [4]). Given the limited par-
allelism available to be exploited in CPUs, most large graphs and tem-
plates timeout. Figure 8.2 shows the speedup of our implementation for
the template-data graph pairs which complete within the 3 hours time limit
on the CPU. We significantly outperform the CPU implementation with a
geometric mean speedup of 10, 678× demonstrating the capabilities of the
GPU for accelerating subgraph enumeration.
Figure 8.2: Speedup compared to state-of-the-art CPU implementation
37
8.2.2 Comparison to GPU Baselines
Figure 8.7 compares the performance of our implementation with state-of-
the-art GPU implementations for the data graphs and template graphs listed
in Section 8.1. The missing data points in Fig. 8.7 refer to the runs that
exceed the 3 hr time limit. Figure 8.7 also shows the speedup in our im-
plementation compared to the fastest baseline in that run. We outperform
the GPU implementations with a geometric mean speedup of 8.56× across
all templates and data graphs. The execution times along with the split be-
tween pre-processing and running time and the corresponding speedups are
tabulated in Table 8.3.
8.2.3 Scaling with Template Size
In this experiment, we fix the data graph to com-dblp, since it is small
enough to have a feasible count for large template graphs. We take two
classes of template graphs, namely cliques and wheels. We choose them as
a representative for contrasting density and symmetry. We vary the number
of nodes within these template classes and plot the corresponding execution
time in Fig. 8.3.
Note the missing data points for PBE and RPS for template size more
than seven. For larger templates, they both seem to encounter memory
management issues. In contrast, both our implementations scale without
any issues and stay fast and efficient across scales. It is noted that edge-
per-block implementation scales better than node-per-block implementation.
This observation is analyzed in depth in Section 8.2.5.
8.2.4 Scaling with Data Graph Size
To better understand how the performance scales with the size of the data
graph, we use synthetic random geometric graphs (RGGs) from the DI-
MACS10 [44] challenge made available via Network Repository [45]. A syn-
thetic RGG at scale s has exactly 2s vertices as demonstrated in Table 8.2.
We use moderately sized six-node wheel and clique as our template graphs
to compare performance at different densities of the template graphs.
Figure 8.4 shows how the performance grows with the data graph size. It
38
(a) Template class: Cliques
(b) Template class: Wheels
Figure 8.3: Performance scaling with change in template graph size
39
Table 8.2: RGG graph statistics at various scales
Scale |V | |E| Max degree Size (in MB)
15 215 = 32,768 320,480 24 2
16 216 = 65,536 684,254 27 4
17 217 = 131,072 1,457,506 28 9
18 218 = 262,144 3,094,566 31 20
19 219 = 524,288 6,539,532 30 43
20 220 = 1,048,576 13,783,240 36 92
21 221 = 2,097,152 28,975,990 37 207
22 222 = 4,194,304 60,718,396 36 448
23 223 = 8,388,608 127,002,786 40 953
24 224 = 16,777,216 265,114,400 40 2109
is observed that our implementation remains fast and efficient at all scales.
Since for these graphs maximum degrees remain more or less constant at all
scales, the change in computation can be attributed to increase in number
of search trees. The increase in the performance gap to our competitors
demonstrates the superior inter-tree parallelization in our approach which is
enabled through the use of persistent memory.
8.2.5 Parallelization Strategy Analysis
In this section, we compare and contrast the performance of the two par-
allelization strategies as explained in Section 7.3 by contrasting their per-
formance scaling characteristics with template graph size. Similar to Sec-
tion 8.2.3 we use two template classes, namely cliques and wheels, and vary
the number of nodes to study the scaling characteristics as shown in Fig. 8.5.
Across all template classes and data graphs, we notice that the edge-per-
block implementation scales better as compared to the node-per-block im-
plementation. The turn over point varies across data graphs, but is generally
around five to seven nodes. The node-per-block implementation amortizes
the cost of extracting the induced subgraph and binary encoding for an en-
tire tree. Whereas, the edge-per-block implementation extracts more paral-
40
Figure 8.4: Performance scaling with change in data graph size
lelism by parallelizing smaller sub-trees and is consequently less susceptible
to load imbalance. For smaller templates, the search trees are shallower,
leading to lower load imbalance and lesser work to compensate the cost of
induced subgraph extraction and binary encoding. Consequently, node-per-
block implementation performs better on smaller templates and gives way to
edge-per-block implementation as the template size increases.
It is also interesting to note the seemingly odd scaling characteristics of
cliques for the cit-patents graph. Unlike other templates and data graphs
where the number of matches also grows, cit-patents does not have many
large cliques. Hence, most trees end up terminating at a shallow depth
resulting in a significantly smaller increase in the computation complexity.
Figure 8.6 contrasts the clique counts for com-dblp and cit-patents to support
the claim.
41
Figure 8.5: Contrasting the two parallelization strategies
Figure 8.6: Comparing growth of clique counts
42
(a) Data Graph: com-dblp; Geometric mean speedup: 5.18×
(b) Data Graph: cit-patents; Geometric mean speedup: 5.26×
Figure 8.7: Execution times and speedups compared to state-of-the-art
GPU implementations
43
(c) Data Graph: com-lj; Geometric mean speedup: 18.01×
(d) Data Graph: com-orkut; Geometric mean speedup: 4.02×
Figure 8.7: Execution times and speedups compared to state-of-the-art
GPU implementations (Continued)
44
(e) Data Graph: com-friendster; Geometric mean speedup: 8.25×
















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































CONCLUSION AND FUTURE WORK
In this thesis, a fast solution for GPU-based subgraph isomorphism is pro-
posed. The solution exploits a traditionally sequential algorithm and exe-
cutes it in an embarrassingly parallel fashion. It brings optimizations from
solution for a specific problem and molds it to solve the general problem. We
demonstrate the performance impact of the solution by comparing it with
the fastest state-of-the-art algorithms and being faster with a considerable
margin.
However, the performance comes at the cost of limiting the templates that
can be queried. The same solution can be extended to solve the general
problem in a few ways. One approach is to extend the central component
from node to edges or even triangles. Computation of the basic components
is extremely fast and can serve as building blocks for rest of the template. An
alternate approach is decompose a general template into respective compo-
nents each of which has a central node. These components can be computed
in parallel and results merged to get the final solution.
Multiple opportunities exist to speed up the existing solution as well. The
blocks and threads are partitioned statically in the current solution. Simi-
larly, memory allocation is also performed statically. This is primarily due
to the overheads of dynamic partitioning of threads and memory alloca-
tion. However, smart dynamic solutions can overcome these limitations to
optimally utilize the memory and compute capabilities of the GPU. Further-
more, since the approach is limited by the maximum degree of the graph,
other approaches might prove to be more effective for small-dense graphs.
49
REFERENCES
[1] D. S. Johnson and M. R. Garey, Computers and Intractability; A Guide
to the Theory of NP-Completeness. WH Freeman, 1979.
[2] J. R. Ullmann, “An algorithm for subgraph isomorphism,” Journal of
the ACM (JACM), vol. 23, no. 1, pp. 31–42, 1976.
[3] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph
isomorphism algorithm for matching large graphs,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 26, no. 10, pp. 1367–
1372, 2004.
[4] V. Carletti, P. Foggia, A. Saggese, and M. Vento, “Challenging the time
complexity of exact subgraph isomorphism for huge and dense graphs
with VF3,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, vol. 40, no. 4, pp. 804–818, 2017.
[5] H.-N. Tran, J.-J. Kim, and B. He, “Fast subgraph matching on
large graphs using graphics processors,” in International Conference on
Database Systems for Advanced Applications. Springer, 2015, pp. 299–
315.
[6] L. Zeng, L. Zou, M. T. Özsu, L. Hu, and F. Zhang, “GSI: GPU-friendly
subgraph isomorphism,” in 2020 IEEE 36th International Conference
on Data Engineering (ICDE). IEEE, 2020, pp. 1249–1260.
[7] L. Wang and J. D. Owens, “Fast gunrock subgraph matching (GSM) on
GPUs,” arXiv preprint arXiv:2003.01527, 2020.
[8] W. Guo, Y. Li, and K.-L. Tan, “Exploiting reuse for GPU subgraph
enumeration,” IEEE Transactions on Knowledge and Data Engineering,
2020.
[9] NVIDIA Corporation, NVIDIA Tesla V100 GPU Architecture, The
World’s Most Advanced Data Center GPU. NVIDIA Corporation, 2017.
[10] J. R. Ullmann, “Bit-vector algorithms for binary constraint satisfac-
tion and subgraph isomorphism,” Journal of Experimental Algorithmics
(JEA), vol. 15, 2011.
50
[11] V. Carletti, P. Foggia, A. Saggese, and M. Vento, “Introducing VF3:
A new algorithm for subgraph isomorphism,” in International Work-
shop on Graph-Based Representations in Pattern Recognition. Springer,
2017, pp. 128–139.
[12] V. Carletti, P. Foggia, A. Greco, M. Vento, and V. Vigilante, “VF3-
light: A lightweight subgraph isomorphism algorithm and its experi-
mental evaluation,” Pattern Recognition Letters, vol. 125, pp. 591–596,
2019.
[13] X. Yan, P. S. Yu, and J. Han, “Graph indexing: A frequent structure-
based approach,” in Proceedings of the 2004 ACM SIGMOD Interna-
tional Conference on Management of Data, 2004, pp. 335–346.
[14] H. He and A. K. Singh, “Closure-tree: An index structure for
graph queries,” in 22nd International Conference on Data Engineering
(ICDE’06). IEEE, 2006, pp. 38–38.
[15] S. Zhang, M. Hu, and J. Yang, “TreePi: A novel graph indexing
method,” in 2007 IEEE 23rd International Conference on Data En-
gineering. IEEE, 2007, pp. 966–975.
[16] H. Shang, Y. Zhang, X. Lin, and J. X. Yu, “Taming verification hard-
ness: an efficient algorithm for testing subgraph isomorphism,” Proceed-
ings of the VLDB Endowment, vol. 1, no. 1, pp. 364–375, 2008.
[17] V. Bonnici, A. Ferro, R. Giugno, A. Pulvirenti, and D. Shasha, “Enhanc-
ing graph database indexing by suffix tree structure,” in IAPR Interna-
tional Conference on Pattern Recognition in Bioinformatics. Springer,
2010, pp. 195–203.
[18] C. Solnon, “Alldifferent-based filtering for subgraph isomorphism,” Ar-
tificial Intelligence, vol. 174, no. 12-13, pp. 850–864, 2010.
[19] J. Larrosa and G. Valiente, “Constraint satisfaction algorithms for graph
pattern matching,” Mathematical structures in computer science, vol. 12,
no. 4, p. 403, 2002.
[20] C. McCreesh and P. Prosser, “A parallel, backjumping subgraph isomor-
phism algorithm using supplemental graphs,” in International Confer-
ence on Principles and Practice of Constraint Programming. Springer,
2015, pp. 295–312.
[21] V. Carletti, P. Foggia, P. Ritrovato, M. Vento, and V. Vigilante, “A
parallel algorithm for subgraph isomorphism,” in International Work-
shop on Graph-Based Representations in Pattern Recognition. Springer,
2019, pp. 141–151.
51
[22] L. Lai, L. Qin, X. Lin, and L. Chang, “Scalable subgraph enumeration
in mapreduce: A cost-oriented approach,” The VLDB Journal, vol. 26,
no. 3, pp. 421–446, 2017.
[23] L. Lai, L. Qin, X. Lin, Y. Zhang, L. Chang, and S. Yang, “Scalable dis-
tributed subgraph enumeration,” Proceedings of the VLDB Endowment,
vol. 10, no. 3, pp. 217–228, 2016.
[24] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph
matching on billion node graphs,” arXiv preprint arXiv:1205.6691, 2012.
[25] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra, “Worst-case optimal join
algorithms,” Journal of the ACM (JACM), vol. 65, no. 3, pp. 1–40, 2018.
[26] L. Lai, Z. Qing, Z. Yang, X. Jin, Z. Lai, R. Wang, K. Hao, X. Lin, L. Qin,
W. Zhang et al., “Distributed subgraph matching on timely dataflow,”
Proceedings of the VLDB Endowment, vol. 12, no. 10, pp. 1099–1112,
2019.
[27] Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu, “Parallel subgraph
listing in a large-scale graph,” in Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, 2014, pp. 625–636.
[28] B. Bhattarai, H. Liu, and H. H. Huang, “CECI: Compact embedding
cluster index for scalable subgraph matching,” in Proceedings of the 2019
International Conference on Management of Data, 2019, pp. 1447–1462.
[29] S. Ma, Y. Cao, J. Huai, and T. Wo, “Distributed graph pattern match-
ing,” in Proceedings of the 21st International Conference on World Wide
Web, 2012, pp. 949–958.
[30] W. Guo, Y. Li, M. Sha, B. He, X. Xiao, and K.-L. Tan, “GPU-
accelerated subgraph enumeration on partitioned graphs,” in Proceed-
ings of the 2020 ACM SIGMOD International Conference on Manage-
ment of Data, 2020, pp. 1067–1082.
[31] S. Ghosh and M. Halappanavar, “TriC: Distributed-memory triangle
counting by exploiting the graph structure,” in 2020 IEEE High Per-
formance Extreme Computing Conference (HPEC). IEEE, 2020, pp.
1–6.
[32] S. Pandey, X. S. Li, A. Buluc, J. Xu, and H. Liu, “H-index: Hash-
indexing for parallel triangle counting on GPUs,” in 2019 IEEE High
Performance Extreme Computing Conference (HPEC). IEEE, 2019,
pp. 1–7.
52
[33] C. Pearson, M. Almasri, O. Anjum, V. S. Mailthody, Z. Qureshi,
R. Nagi, J. Xiong, and W.-m. Hwu, “Update on triangle counting on
GPU,” in 2019 IEEE High Performance Extreme Computing Conference
(HPEC). IEEE, 2019, pp. 1–7.
[34] V. S. Mailthody, K. Date, Z. Qureshi, C. Pearson, R. Nagi, J. Xiong,
and W.-m. Hwu, “Collaborative (CPU + GPU) algorithms for triangle
counting and truss decomposition,” in 2018 IEEE High Performance
extreme Computing Conference (HPEC). IEEE, 2018, pp. 1–7.
[35] S. Diab, M. G. Olabi, and I. El Hajj, “KTrussExplorer: Exploring
the design space of k-truss decomposition optimizations on GPUs,” in
2020 IEEE High Performance Extreme Computing Conference (HPEC).
IEEE, 2020, pp. 1–8.
[36] M. Almasri, O. Anjum, C. Pearson, Z. Qureshi, V. S. Mailthody,
R. Nagi, J. Xiong, and W.-m. Hwu, “Update on k-truss decomposi-
tion on GPU,” in 2019 IEEE High Performance Extreme Computing
Conference (HPEC). IEEE, 2019, pp. 1–7.
[37] M. Almasri, I. E. Hajj, R. Nagi, J. Xiong, and W. mei Hwu, “Acceler-
ating k-clique counting on GPUs,” in Proceedings of the 547th Interna-
tional Conference on Very Large Data Bases (VLDB), 2021.
[38] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song et al., “Static graph
challenge: Subgraph isomorphism,” in 2017 IEEE High Performance
Extreme Computing Conference (HPEC). IEEE, 2017, pp. 1–6.
[39] J. A. Grochow and M. Kellis, “Network motif discovery using subgraph
enumeration and symmetry-breaking,” in Annual International Confer-
ence on Research in Computational Molecular Biology. Springer, 2007,
pp. 92–106.
[40] B. D. McKay and A. Piperno, “Practical graph isomorphism, II,” Jour-
nal of Symbolic Computation, vol. 60, pp. 94–112, 2014.
[41] J. Jenkins, I. Arkatkar, J. D. Owens, A. Choudhary, and N. F. Sama-
tova, “Lessons learned from exploring the backtracking paradigm on the
GPU,” in European Conference on Parallel Processing. Springer, 2011,
pp. 425–437.
[42] D. Merrill, “CUB,” NVIDIA Research, 2015.
[43] J. Leskovec and A. Krevl, “SNAP datasets: Stanford large network
dataset collection,” 2014.
53
[44] D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, Graph parti-
tioning and graph clustering. American Mathematical Society Provi-
dence, RI, 2013, vol. 588.
[45] R. A. Rossi and N. K. Ahmed, “The network data repository with





Here we present an example to provide a walk-through of Algorithm 1 to get
the query sequence from a given template graph. We use the template graph
as shown in Fig. A.1 for this example. Figure A.2 provides a summary of the
algorithm state as it process the template graph.
Figure A.1: Template graph example
Figure A.2(a) demonstrates the initial state of the algorithm as depicted
in Lines 1-2. dM refers to the node-mapping degree of each node, P denotes
the likelihood for each node as is computed in Line 4-8, and ST denotes the
output query sequence.
In the first iteration, the node-mapping degree, dM of all nodes is 0 since
there are no nodes in the query sequence ST yet. Hence, the likelihood for
each node is simply the degree of the node. Consequently, the node with the
maximum likelihood in this iteration is the highest degree node as shown in
Fig. A.2(b).
At the end of first iteration, the node-mapping degree for each neighbor of
node D is updated as in Lines 11-13. Since node D is connected to every other
node, dM of all other nodes is still the same causing us to choose the node
with the next highest degree in the second iteration as shown in Fig. A.2(c).
The node-mapping degree of neighbors of A are incremented at the end of
the second iteration. While this includes D, we ignore its value since it is
already accounted in the sequence. Notice that the likelihood or priority of
55
Figure A.2: Query sequence generation example
56
nodes B and C gets bumped up compared to node E and are equal. The tie
is broken arbitrarily and node B is chosen as shown in Fig. A.2(d).
With nodes A and D already in the sequence, none of the neighbors of node
B get their node-mapping degree updated. Consequently, the likelihoods
remain unchanged except for node B. Node C is chosen as a result in the
fourth iteration as shown in Fig. A.2(e). With only one node left in the
sequence, it is added to the sequence at the end of the fifth iteration to
obtain the final query sequence as shown in Fig. A.2(f).
57
