A Block-Based Triangle Counting Algorithm on Heterogeneous Environments by Yaşar, Abdurrahman et al.
1A Block-Based Triangle Counting Algorithm on
Heterogeneous Environments
Abdurrahman Yas¸ar, Sivasankaran Rajamanickam (Member, IEEE), Jonathan Berry, and
U¨mit V. C¸atalyu¨rek (Fellow, IEEE)
Abstract—Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle
counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the
algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is
well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small
tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by
providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle
enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric,
our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the
fastest published CPU-only time.
Index Terms—triangle counting, task-based, block-based, sub-graph, multi-core, multi-GPU.
F
1 INTRODUCTION
Graphs are very useful data-structures. They can represent
different applications, such as social network analytics, bi-
ological networks, and scientific simulations. Today there
exist large graphs that have billions of vertices and edges.
High-performance processing of these large graphs is cru-
cial and pervasive. Most of the current high-performance
solutions rely on distributed systems because one can pro-
cess problem instances that are bigger than the main mem-
ory of a single node. However, communication and data
replication in distributed systems often leads to bottlenecks.
The popularity of multi-core machines increased drasti-
cally during the past decade. Today, different hardware
vendors have developed processors with multiple cores
to deliver improved performance, and hence the multi-
core technology became ubiquitous. In addition to multi-
core technology, many hardware accelerators like graphics
processing units (GPUs), and field-programmable gate ar-
rays (FPGAs) also became a part of the systems to serve
different parallelization needs, and more are coming. Soon,
we will have distributed systems consisting of multi-core
servers with many different types of hardware accelerators.
Such an environment increases the importance of design-
ing flexible algorithms for performance-critical kernels and
implementations that can run well on various platforms.
In this work, we propose an architecture-agnostic triangle
counting algorithm, BBTC, and we do an extensive analysis
• Yas¸ar, and C¸atalyu¨rek are with the School of Computational Science and
Engineering at Georgia Institute of Technology.
E-mail: {ayasar,umit}@gatech.edu
• Rajamanickam, and Berry are with the Center for Computing Research at
Sandia National Laboratories, Albuquerque, NM.
E-mail: {srajama,jberry}@sandia.gov
Manuscript received xx xx, xx; revised xx xx, xx.
of its performance on a single node multi-CPU and multi-
GPU execution environment.
u
v
w
(a) A triangle
G0
G2
G4
u x x
xv
v w
(b) A 5-way 1D partition
G0,2
G2,4
G0,4
u x x
xv
v w
(c) A 5x5-way 2D partition
Fig. 1. Finding graph triangles in 1D and 2D-partitioned adjacency
matrices.
The triangle counting problem [1]–[9] is to find the
number of 3-cliques in an undirected graph (Fig. 1(a)). This
is a crucial graph kernel that serves as a building block
for many other graph problems such as k-truss decom-
position [10], community detection [11], thematic structure
identification [12], dense sub-graph discovery [13] and link
recommendation [14]. In recent years, algorithmic and ar-
chitectural advances have led to great improvements in
triangle counting performance [3], [15]–[17]. None of these
approaches are truly heterogeneous because they all target
one architectural feature. In Section 4, we propose a block-
ar
X
iv
:2
00
9.
12
45
7v
1 
 [c
s.D
S]
  2
5 S
ep
 20
20
2based approach for heterogeneous execution environments
that simultaneously leverages both CPUs and multiple
GPUs. Such an approach is crucial for achieving three
goals: maximizing memory utilization, processing graphs
that cannot fit into GPU memory and overlapping the data-
copy time with the computation. To achieve these goals,
we tackle the triangle counting problem under three axes;
data/computation partitioning, a block-based formulation
for the triangle counting problem, and architecture agnostic
task execution.
At the abstract level, our execution model is as follows.
Computation is divided into tasks. Each task depends on
multiple blocks. A light-weight scheduler schedules tasks
on available devices. Scheduler is also responsible for the
movement of data blocks that are needed for execution of
each individual task.
Execution time is a function of the computational load
of individual tasks, and communication costs. The granu-
larity of the task decomposition is crucial for performance.
Therefore, partitioning is the first step and refers to both
data and computation partitioning. In the context of graphs,
we can do the partitioning at the vertex level (coarse-
grained) or the edge level (fine-grained). A hybrid approach
may also be used. There is a duality between graphs and
sparse matrices. Vertex-level and edge-level partitionings
correspond to row (1D) and nonzero (2D) partitionings. For
visualization purposes, consider the adjacency matrix rep-
resentation of a graph. Fig. 1(b) illustrates a 1D partitioning.
This generates coarse-grained tasks. Some state-of-the-art
CPU-based triangle counting algorithms (such as [2], [18])
generate that kind of coarse-grained tasks. However, such a
task generation ends up with higher copy/communication
costs on a heterogeneous environment and highly imbal-
anced workload distribution [19]–[23]. We can address this
problem by using a 2D partitioning strategy and generating
medium-grained tasks (e.g., see Fig. 1(c)) to solve the tri-
angle counting problem. Generally, there are two classes of
algorithms that provide such partitioning: i) connectivity-
based techniques (graph and hypergraph) [21], ii) spatial
partitioning techniques [20]. Connectivity-based techniques
find a reordering of the vertices while spatial techniques
break nonzeros of the matrix into blocks without reordering
the vertices.
We seek partitioning methods that maximize memory ef-
ficiency and minimize data movement which is essential to
overlap the communication with the computation. Unfortu-
nately, unstructured or random partitionings may easily end
up with highly irregular data access patterns (i.e., one block
may require to access all other blocks). Our ultimate goal
is to localize sub-graph computations, by improving data
access patterns and data movement in the system architec-
ture. Therefore we need a special partitioning scheme that
we could use for divide-and-conquer approaches. For this
reason, we utilize a special 2D partitioning scheme called
symmetric rectilinear partitioning (also known as symmetric
generalized block distribution (see Fig. 1(b)). This problem,
first defined by Grigni and Manne [24]. Recently, Yas¸ar and
C¸atalyu¨rek [25] proposed some heuristic algorithms to solve
this problem. It is also possible to achieve such partitioning
by utilizing more complex connectivity-based [21] tech-
niques. However, these techniques have two drawbacks;
first, most of these techniques are computationally expen-
sive, second, connectivity-based ordering of the vertices
may affect the computational complexity of the algorithm.
In sparse graphs with heavy-tailed degree distributions,
triangle counting has low computational complexity [5]
therefore we need a fast partitioning technique. In this work,
we use a parallel version of the PBD algorithm presented
in [25] to partition the graph into blocks.
We propose a medium-grained triangle counting formu-
lation (BBTC) on top of symmetric rectilinear partitioning.
This makes the algorithm naturally suitable for task-based
execution on shared and distributed-memory systems as
well as on heterogeneous architectures. BBTC defines tasks
based on generated blocks. CPUs and GPUs process tasks
together. Although the creation of tasks is dependent on
graph data, our algorithm to dispatch and execute these
tasks is data-agnostic. This agnostic structure will allow
execution on heterogeneous settings with any accelerator
and CPU combination.
One can further optimize the execution time by uti-
lizing different data structures and implementation of the
graph kernel based on properties of the input data and
architectures. However, in this work, we utilize a single
data-structure and just one implementation of each task
type on each processor type. The power of the algorithm
comes from partitioning of the triangle space into tasks, a
dynamic task-scheduling scheme that schedules tasks on
multi-core-CPUs, use of multiple streams on multiple GPUs
to effectively utilize the massive computing capabilities on
the GPUs, and, at the same time, move the data to the GPUs
asynchronously.
The primary contributions of this work can be listed as:
• We show how to partition a graph’s triangle set, and
we use this observation to derive a new block-based
triangle counting algorithm.
• We show how to run our algorithm on heterogeneous
hardware when the graph does not fit on device.
• We propose dynamic scheduling techniques for
block-based execution model for heterogeneous en-
vironments and show how tasks can be distributed
among different CPUs and GPUs efficiently.
• We evaluate characteristics of our algorithm in the-
oretical and practical aspects and perform extensive
analysis on different architectures.
The preliminary results of this work were presented as
an extended abstract in [16]. This paper presents a detailed
description of our algorithm and its complexity analysis,
a thorough experimental evaluation of the components of
the algorithm and comparisons against the state-of-the-art.
These results demonstrate that our implementation beats
the fastest published end-to-end CPU execution times. The
performance improvement is up to 11× over our previous
state-of-the-art implementation [18], 16× over another state-
of-the-art implementation [2], and 1.5× over state-of-the-art
multi-GPU implementation [3] on the largest three graphs.
2 PROBLEM FORMULATION
An undirected graph G = (V,E), consists of a set of
vertices V and a set of edges E. An edge e is referred
3to as e = (u, v) ∈ E, where u, v ∈ V . In the context of
triangle counting, efficient algorithms traverse vertices and
edges following a complete ordering, which avoids double
counting and also reduces the number of edges traversed.
For the sake of simplicity, in this work, we will use vertex
IDs for that complete ordering in the traversal. Since G is
undirected, we represent each edge as (u, v), where u < v.
Under this representation, we call u the source vertex and
v the destination vertex. Thus, we induce an ordering that
is helpful when explaining our partitioning. The neighbor
list of a vertex u ∈ V is also defined using the same
complete ordering, i.e., N(G, u) = {v ∈ V : (u, v) ∈ E
and u < v}. The degree of a vertex u ∈ V is defined as
d(G, u) = |N(G, u)|. We will use n and m for number of
vertices and edges, respectively, i.e., n = |V | and m = |E|.
Definition 1. Triangle Counting Problem. Given an undi-
rected graph G = (V,E), the triangle counting problem
computes the number of unique three cliques, i.e., the num-
ber of mutually connected three vertices, in the graph, G.
Given a graph G, and an integer p such that 1 ≤
p ≤ n. Let V p be a vertex partition set of V into p
non-overlapping partitions such that; V p = {V0, . . . , Vp−1}
where ∪0≤i≤p−1Vi = V and ∩0≤i≤p−1Vi = ∅.
Let Gi,j be a sub-graph of G such that Gi,j = {(u, v) ∈
E : u ∈ Vi ∧ v ∈ Vj}. By definition G = ∪0≤i,j≤p−1Gi,j .
Our Gi,j construct allows us to avoid double-counting any
triangles. In the partitioned graph we will only use sub-
graphs (Gi,j) that appear in upper triangular part of the
adjacency matrix, (i ≤ j) which allow us to only traverse
edges (u, v) ∈ Gi,j , where u < v by definition. Source and
destination vertex sets in a given sub-graphGi,j , are defined
as Vs(Gi,j) = Vi and Vd(Gi,j) = Vj , respectively. The partial
neighbor list of a vertex u ∈ Gi,j is defined as N(Gi,j , u).
The partial degree of a vertex u ∈ Vs(Gi,j) is defined as
d(Gi,j , u) = |N(Gi,j , u)|. All the notations are summarized
in Table 1.
TABLE 1
Notation used in this paper.
Symbol Description
G = (V,E) A graph G with vertex
and edge sets, V and E, respectively
n = |V | Number of vertices
m = |E| Number of edges
V p Vertex partition set; V p = {V1, . . . , Vp}
Vs(Gi,j) = Vi Source vertices of the subgraph Gi,j
Vd(Gi,j) = Vj Destination vertices of the subgraph Gi,j
N(G, u) Neighbor list of vertex u ∈ V
N(Gi,j , u) Partial neighbor list of vertex u, i.e.,
N(Gi,j , u) = {(u, v) : u ∈ Vi ∧ v ∈ Vj}
d(G, u) Degree of vertex u, d(G, u) = |N(G, u)|
d(Gi,j , u) Partial degree of vertex u ∈ Vs(Gi,j),
d(Gi,j , u) = |N(Gi,j , u)|
t A task; t = {i, j, k} = {Gi,j , Gj,k, Gi,k}
where i ≤ j ≤ k
T = {t0, . . . , ti} Set of all tasks
Algorithm 1: NINTERSECT (A,B)
. Computes intersection count. A and B are sorted.
a← 0; b← 0; c← 0;
while a < |A| and b < |B| do
if A[b] = B[b] then
++c; ++a; ++b;
else if A[a] < B[b] then
++a
else
++b
return c
Algorithm 2: TC-INTERSECT (G)
τ ← 0 . Initialize number of triangles to 0
for each u ∈ V do
for each v ∈ N(G, u) do
τ ← τ+ NINTERSECT (N(G, u), N(G, v))
return τ
3 WORK OPTIMAL TRIANGLE COUNTING ALGO-
RITHMS
Sequential triangle counting is a well-studied problem [1],
[5]. The best algorithms rely on matrix multiplication and
run in O(nω) or O(mω/(1+ω)) where ω is the matrix mul-
tiplication exponent [6], [8]. However, these algorithms are
impractical and require θ(n2) space. The fastest algorithm
that doesn’t require θ(n2) space does O(m3/2) work [1].
Merge-based and hashmap-based versions can both result in
work optimal algorithms. Both algorithms process vertices
based on their degrees to bound the maximum degree in
the graph and guarantee O(m3/2) work. This bound can be
improved if the degree distribution is governed by a power
law with exponent α. In such a case, Latapy [1] shows that
triangle enumeration is bounded by Θ(mn
1
α ). This is an
asymptotic improvement over the worst-case. Berry, et al. [5]
improved upon this further and show that complexity of
this problem can actually be Θ(n) in realistic circumstances
where the 4/3 moment of a graph is bounded by a constant.
Alg. 1 outputs the number of the common elements
in given two sorted lists using the set intersection. This
algorithm can be used in triangle counting if the neigh-
borhood of each vertex is sorted based on the vertex IDs.
While this algorithm is cache-friendly, it results in poor
performance when there are high degree vertices. Alg. 2
outputs the number of triangles in the graph using this
list-based intersection. Alg. 3 uses a hash map to count the
number of common neighbors in two vertices’ adjacency
list.
Variations of Alg. 2 and Alg. 3 are widely used in
many sequential and parallel works [1], [2], [18]. Latapy [1]
proposed a refinement technique inspired from ayz-listing
that doesn’t require θ(n2) space. This refinement technique
applies a hash map based algorithm for high-degree vertices
and a list-based algorithm for low-degree vertices. In this
paper, we use Latapy’s algorithm as our sequential baseline
since it provides the fastest sequential execution.
4Algorithm 3: TC-HMAP (G)
τ ← 0 . Initialize number of triangles to 0
for each u ∈ V do
for each v ∈ N(G, u) do
H[v]← u
for each v ∈ N(G, u) do
for each w ∈ N(G, v) do
if H[w] = u then
τ ← τ + 1
return τ
0
3
2
1
4
5
8
9
6
7
(a) Toy Graph
0
1
9
3
6
7
8
2
4
5
(b) Degree Based Ordering
0
1
2
4
3
5
7
6
9
8
0 1 2 43 5 76 98
G0,0 G0,1 G0,2
G1,0 G1,1 G1,2
G2,0 G2,1 G2,2
(c) Matrix Form
G0,0 G0,1 G0,2 G1,0 G1,1 G1,2 G2,0 G2,1 G2,2
3 6 5 7 5 7 9 9 8 9 8 7 5 6 7 7
5 10 15 18 21 24 26 28 28
Dst. Vertices
Subgraphs
Pointers
1 1 1 1 1 1 2 3 4 6 7 8 9 10 11 11 11 11 11 12 12 13 15 16 16 16 16 16
0 1 2 3 4 5 76 108 9 11 12 13 14 15 16 17 1918 2220 21 23 24 2625 27
Dst. Pointers 16 16
27 28
(d) Block CSR Layout
Fig. 2. Toy Example. 2(a): a toy graph. 2(b): Degree-based ordering
of 2(a), each edge in a block colored with same color and each
vertex in a vertex partition colored with the same color. 2(c): Adja-
cency matrix representation of the symmetric rectilinear partitioned
graph. 2(d): Block CSR representation of the given partitioned graph.
Tasks, T = {{0, 0, 0}, {0, 0, 1}, {0, 0, 2}, {0, 1, 1}, {0, 1, 2}, {0, 2, 2},
{1, 1, 1}{1, 1, 2}, {1, 2, 2}, {2, 2, 2}}.
4 BLOCK-BASED TRIANGLE COUNTING (BBTC)
Most of the current state-of-the-art on the triangle counting
problem require access to the entire adjacency list for each
vertex. In this paper, we investigate the usage of partial
adjacency lists (i.e., subgraphs).
4.1 Subgraph generation
A contiguous rectangle in an adjacency matrix corresponds
to an edge-induced subgraph (a subset of edges in the graph).
We propose to use symmetric rectilinear partitioning (also
known as symmetric generalized block distribution [24])
to generate these subgraphs. This partitioning is similar to
2D matrix distributions that have been widely used in
dense algebra with cartesian distributions [26] (i.e., the same
partitioning vector is used for row and column partition)
and also applied to sparse matrices [27]. In this partitioning,
diagonal blocks are required to be squares.
As stated earlier, in order to avoid double counting and
reduce the amount of edge traversal, we use a total ordering
of vertex IDs and also only store upper triangular part
of the adjacency matrix. Furthermore, in order to reduce
the number of edges traversed even further, similar to
existing efficient sequential algorithms [1], [5], [7], [9], we
re-order vertices in non-decreasing degree order, then apply
symmetric partitioning to the adjacency matrix after these
transformations. Figure 2 illustrates this process on a toy
graph.
Again, by definition, edges of a triangle can appear in
at most three of the subgraphs defined by our block-based
partitioning. Furthermore, in a triangle, for a chosen pair of
edges, there exists only one sub-graph such that the third
edge belongs. Therefore we can bound the computational
complexity of triangle counting and the data movement cost
using symmetric rectilinear partitioning.
Our algorithm can work with any valid symmetric par-
titioning. The only constraint is the limited memory of the
computing devices that will be used. Therefore, ideal sym-
metric partitioning should limit the sizes of subgraphs, such
that three subgraphs can fit into memory of the computing
devices. In this work, we used a lightweight symmetric
rectilinear partitioning algorithm called PBD [25]. After
the generation of the blocks, we use a block variant of
CSR (BCSR) [28] to store the graph. This storage format is
illustrated in Fig. 2(d). In this layout, blocks are ordered in
row-major order.
4.2 Composition of the task list
Assume that vertices u, v and w form a triangle in a given
graph (see Fig. 1). Let the first edge, (u, v), appear in the
subgraph Gi,j . It means that u ∈ Vi and v ∈ Vj . Hence,
the second edge, (v, w), may appear in any subgraph Gj,k
where j ≤ k ≤ p − 1 so w ∈ Vk. Therefore, by definition
the third edge, (u,w) appears in Gi,k because u ∈ Vi and
w ∈ Vk. In the context of this paper a task can be defined as
below.
Definition 2. Task. Given a symmetric rectilinear parti-
tioned graph, G = (V,E) that is partitioned into p × p
blocks, and a triplet {i, j, k} such that i ≤ j ≤ k, a task,
t, is defined as the unit of work to count the number of
triangles in {Gi,j , Gj,k, Gi,k}.
To illustrate, in Fig. 2(b), vertices 3, 7 and 9 form a
triangle in the toy graph. The first edge, (3, 7), appears
in the subgraph G0,1, the second edge, (7, 9), appears in
the subgraph G1,2 and the third edge, (3, 9) appears in the
subgraph G0,2. This would be counted when G0,1, G1,2 and
G0,2 form a task; t = {0, 1, 2} = {G0,1, G1,2, G0,2}.
One can count number of triangles, (u, v, w), in given
partitioned graph by considering all possible triples of
blocks (i.e., tasks); {Gi,j , Gj,k, Gi,k} where (u, v) ∈ Gi,j ,
(v, w) ∈ Gj,k and (u,w) ∈ Gi,k. However in an undirected
graph this approach counts each triangle six times. Most
state-of-the-art work treats undirected graphs as directed by
considering edges from smaller vertex IDs to larger vertex
IDs (or vice versa) to be able to count each triangle once.
In the block-based context, this rule can be applied by
considering subgraph indices in increasing order, such that
i ≤ j ≤ k. With this restriction, if the vertex set, V , of a
given graph partitioned into p parts, then number of tasks,
|T |, is defined as:
|T | = p× (p+ 1)× (p+ 2)
6
5Algorithm 4: BBTASKCOMPOSITION (G, p)
T ← ∅ . Initialize empty task list
for i = 0 to p− 1 do
for j = i to p− 1 do
for k = j to p− 1 do
T ← T ∪ {Gi,j , Gj,k, Gi,k}
return T
For instance, as shown in the toy example presented in
Fig. 2 10, tasks can be defined. Alg. 4 illustrates task list
composition for a given symmetric rectilinear partitioned
graph.
4.3 Counting triangles in a task
Algorithm 5: BBTC-INTERSECT (Gi,j , Gj,k, Gi,k)
τ ← 0 . Initialize number of triangles to 0
for each u ∈ VS(Gi,j) do
for each v ∈ N(Gi,j , u) do
τ ← τ+ NINTERSECT (N(Gi,k, u), N(Gj,k, v))
return τ
In the context of this paper a task is defined as counting
the number of triangles that appear in three subgraphs;Gi,j ,
Gj,k and Gi,k where i ≤ j ≤ k. For each edge, (u, v) in
Gi,j , this operation can be done by counting the number
of common neighbors between the partial neighbor list of
u in Gi,k (N(Gi,k, u)) and the partial neighbor list of v
in Gj,k (N(Gj,k, v)). Similar to regular triangle counting
algorithms this can be computed using list or hashmap-
based intersection algorithms as described in Alg. 5 and
Alg. 6 respectively.
Algorithm 6: BBTC-HMAP (Gi,j , Gj,k, Gi,k)
τ ← 0 . Initialize number of triangles to 0
for each u ∈ VS(Gi,j) do
for each v ∈ N(Gi,k, u) do
H[v]← u
for each v ∈ N(Gi,j , u) do
for each w ∈ N(Gj,k, v) do
if H[w] = u then
τ ← τ + 1
return τ
4.4 Computational Complexity
In this subsection, we will present the complexity analysis
of BBTC. Let the input graph G = (V,E) be partitioned
with a p×p symmetric rectilinear partitioning. Let mmax and
mavg be, respectively, the maximum and average number of
edges (nonzeros) within subgraphs (blocks). We also define
load imbalance, λ, as λ = mmaxmavg . Let dmax be the maximum
degree in the input graph, i.e.,
dmax = max
u∈V
d(G, u),
and d′max be the maximum degree within all blocks, i.e.,
d′max = max
u∈Vs(Gi,j).∀0≤i,j<p
d(Gi,j , u).
(Please note that, they can be degrees of different vertices).
We define c as the ratio of these max degrees, i.e., c = dmaxd′max .
Clearly 1 ≤ c ≤ p.
One may see Alg. 5 and Alg. 6 as iterating over the
edges in Gi,j and counting common neighbors of source
and destination vertices’ partial neighbor lists that appear
in Gi,k and Gj,k, respectively. Hence, since there can be
at most mmax edges in Gi,j and maximum degree of a
vertex in a block is defined with d′max, a single task can
be solved in O(mmaxd′max) time. The maximum number of
nonzeros (mmax) among the blocks can be defined using
load imbalance as, mmax = λmavg. There are
p(p+1)
2 blocks,
hence the average number of nonzeros (mavg) per block is
mavg =
2m
p(p+1) . When we replace mavg with that representa-
tion, mmax equals to λ 2mp(p+1) . A task can be solved in
O
(
λ
2m
p(p+ 1)
dmax
c
)
time. Since there can be at most
p(p+ 1)(p+ 2)
6
concurrent tasks, our block-based triangle counting problem
can be solved in
O
(
p(p+ 1)(p+ 2)
6
λ
2m
p(p+ 1)
dmax
c
)
which is equal to
O
(
λp
c
mdmax
)
when we apply proper simplifications. When the vertices
are ordered based on their degrees, the intersection opera-
tion between neighbor lists of two vertices can be computed
in O(
√
m) [1] instead of O(dmax). Hence, block-based trian-
gle counting, using the degree ordering of vertices, can be
solved in
O
(
λp
c
m3/2
)
.
Note that, for dense graphs, achieving the perfect load-
imbalance (λ ≈ 1) is straightforward, and with high prob-
ability, c and p would be nearly equal. Hence, in dense
instances, the computational complexity of the block-based
formulation is the same as the work-optimal triangle count-
ing algorithms. In irregular problem instances, the load
imbalance is crucial in the computational complexity. There-
fore, balanced partitioning of the graph is essential. When
the partitioned adjacency matrix has the perfect load imbal-
ance, the computational complexity of the irregular problem
instances becomes nearly equal to the work-optimal cases.
5 PARALLEL AND HETEROGENEOUS EXECUTION
The BBTC algorithm can be parallelized by concurrently
executing the tasks. The final number of triangles is the
sum of the triangles found on all the tasks. In this work,
we utilize CPUs and GPUs to process all the tasks. At
a high level, there are no architecture-specific changes in
the algorithm. That allows the execution of our algorithm
on any accelerator and CPU combination. However, task
handling differs between CPU and GPU because of architec-
tural differences. While a CPU thread executes a single task,
6threads execute cooperatively to complete a task in parallel
on GPUs. Alg. 9 illustrates this difference.
In hybrid execution environments, assigning computa-
tionally heavy tasks to GPUs and lighter tasks to CPUs
(e.g., [29]) is one of the essential optimizations to leverage
the massive parallelism available on GPUs. We implement
a similar approach. Tasks are ordered based on their esti-
mations in the task queue (Q). GPUs start with heavy tasks
from one end of the execution queue, and CPUs process
light tasks from the other end of the execution queue. They
continue to move towards each other as they complete
the assigned tasks. BBTC keeps a global pointer, (qc), for
the last task id that a CPU thread starts to execute. Each
CPU thread atomically decrements the pointer until finding
an unprocessed task or terminates if the counter reaches
the cut-off point of the task queue. That continues while
the execution queue is not empty. Fig. 3 illustrates this
procedure. The execution of tasks by GPUs and CPUs are
described in Alg. 7 and Alg. 8, respectively.
Task
 0
Task
1
Task
 2
Task
 3
D0/S0 D1/S0 D0/S1 D1/S1 D0/S0 D1/S0 D0/S1 D1/S1 D0/S0 D1/S0 
HX HX HX HX HX HX
GPU(s)
CPU(s)
Task
 4
Task
 5
Task
 6
Task
 7
Task
 8
Task
 9
cut-off
Fig. 3. Task Scheduling Between Devices and Host. Di: ith Device, Si:
ith Stream on a Device, Hx: a CPU
For sorting, we estimate the workload of a task, t, using:
ExecT ime(t) = |{(u, v) ∈ Gi,j}| ×max{δ(Gi,k), δ(Gj,k}
where δ(Gi,j) represents the average degree of a subgraph
Gi,j i.e.,
δ(Gi,j) =
∑
∀u∈Vi d(Gi,j , u)
|Vi| .
We sort tasks in the non-increasing order based on their
estimation weights. To decrease the initial synchronization
cost of GPUs and guarantee the assignment of heavier tasks
to the GPUs, a cut-off (red line in Figure 3) is pre-defined.
CPUs do not go past the cut-off.
We use CUDA streams to execute several tasks on
the GPUs simultaneously. We create four CUDA streams,
(ns = 4), for one of the ng GPUs. Then, one of the nc CPU
threads is assigned to a stream (see Alg. 9). That CPU thread
is responsible for waiting on that stream, synchronizing the
stream, sending a task to a device through that stream, and
gathering information from a device through that stream.
All of these operations use asynchronous function calls.
When we create streams and assign a thread for each of
them, GPUs and CPUs compete for tasks and get a new one
from the queue when they finish executing a task.
The cut-off allows all GPU streams to run tasks until
the cut-off without worrying about the next assignment (the
first while loop in Alg. 7). GPUs can go past the cut-off line
if more tasks are available (the second while loop in Alg. 7).
We allow GPUs and CPUs to compete for the tasks.
However, GPUs do not compete among themselves. We
assign tasks to GPUs in a deterministic round-robin fashion.
This approach allows stream threads to know the next task
Algorithm 7: RUNTASKONGPU (tId)
gId← tId mod ns . GPU id
sId← tId/ns . Stream id
while tId < cut-off do
for each Gi,j ∈ T [tId] do
if not ISCOPIED (Gi,j , gId) then
ASYNCCOPY (Gi,j , gId)
Gi,j , Gj,k, Gi,k ← T [tId] . Get copied blocks
if ISDENSE (T [tId]) then
async BBTC-HMAP (Gi,j , Gj,k, Gi,k)
else
async BBTC-INTERSECT (Gi,j , Gj,k, Gi,k)
tId← tId+ ng × ns
. Copy blocks for next task, tId.
for each Gi,j ∈ T [tId] do
if not ISCOPIED (Gi,j , gId) then
ASYNCCOPY (Gi,j , gId)
. Get a task from Q if available, stop otherwise
while tId < |T | and not ATOMICSET (Q, tId) do
SYNCSTREAM (sId, gId) . Synchronize the
stream
Gi,j , Gj,k, Gi,k ← T [tId] . Get copied blocks
if ISDENSE (T [tId]) then
async BBTC-HMAP (Gi,j , Gj,k, Gi,k)
else
async BBTC-INTERSECT (Gi,j , Gj,k, Gi,k)
tId← tId+ ng × ns
for each Gi,j ∈ T [tId] do
if not ISCOPIED (Gi,j , gId) then
ASYNCCOPY (Gi,j , gId)
Algorithm 8: RUNTASKONCPU ()
tId← ATOMICDECREMENT (qc) . Get task index.
τ ← 0
while t ≥ cut-off do
. Check if task has been processed and set if not.
if not ATOMICSET (Q, tId) then
Gi,j , Gj,k, Gi,k ← T [tId]
if ISDENSE (T [tId]) then
τ ← τ+ BBTC-HMAP (Gi,j , Gj,k, Gi,k)
else
τ ← τ+ BBTC-INTERSECT
(Gi,j , Gj,k, Gi,k)
else
tId← ATOMICDECREMENT (qc)
return τ
that can be executed by that stream. Then a stream thread
can overlap copying the blocks of the next assignment with
the computation.
6 RELATED WORK
Inspired by AYZ listing [8], Latapy [1] proposed a refine-
ment technique that does not require θ(n2) space. Latapy’s
algorithm uses the list-based intersection for small degree
vertices and the hashmap-based intersection for high degree
vertices. In this work, we use a similar approach; i.e., any
7Algorithm 9: BBTC (G, p)
. Initialization of the system variables.
nc ← NCORES(); ng ← NGPUS(); ns ← 4;
τ ← 0 . Number of triangles
. Partitioning
G(V p, E)← PBD(G, p) . Parallel PBD algorithm
. Composing task list
T ← BBTASKCOMPOSITION (G, p) . Composing
task list
T ← ORDERTASKS (T ) . Heuristic based ordering of
tasks
. Initializing a list of bit flags for each task
Q[i] = 0, for 0 ≤ i ≤ |T | − 1
qc ← |T | . Queue index for competing cores.
. Spawning gpu threads: one per each stream
for i = 0 to ns − 1 do
for j = 0 to ng − 1 do
spawn RUNTASKONGPU (i× s+ j)
. Spawning cpu threads
for i = ns × ng to nc do
spawn RUNTASKONCPU ()
. Gathering & aggregating counts
for i = 0 to ns − 1 do
for j = 0 to ng − 1 do
t← GETCOUNT(i, j)
ATOMICADD (τ, t)
return τ
task with sparse blocks uses the list-based intersection, and
any task consist of dense blocks uses a hashmap. Shun
and Tangwongsan [2] parallelized Latapy’s compact-forward
algorithm. Recently, Dhulipala et al. [4] re-implemented an
optimized version of this algorithm.
A linear-algebra-based triangle counting implementa-
tion, kkTri (previously designated TCKK), was proposed
by Wolf et al. [30]. That work focused on efficient shared
memory parallelism on top of a portable SpGEMM (called
KK-MEM) [31] in the Kokkos Kernels library1. This work op-
timized two linear-algebra-based formulations of the trian-
gle counting problem and used different types of hashmap
accumulators based on the sparsity of the graphs. In a later
work, Yas¸ar et al. [18] improved this algorithm by using a 1D
task-based approach and an adaptive runtime. In this paper,
we use 2D partitioning instead of 1D partitioning, and we
always use dense hashmaps. The memory requirement for
dense hashmaps decreases significantly because of the 2D
partitioning. The smaller dense hashmaps are also more
suitable for GPU architectures.
Green et al. [17] proposed a list-intersection-based algo-
rithm on a single GPU. In that work, neighbor lists of source
and destination vertices of each edge are concurrently pro-
cessed by 32 threads from a warp. However, this approach
comes with expensive partitioning overhead. Hu et al. [3] try
to overcome this problem by introducing a binary search-
based intersection method. Both works copy the graph into
1. https://github.com/kokkos/kokkos-kernels
TABLE 2
Overview of the Architectures. Pinned: transfers through pinned
memory. Pageable: transfers through pageable memory. Peer to peer:
transfers done between devices. H2D: host to device. D2H: device to
host. (measured bandwidth)
DGX Newell Haswell
CPU Intel, E5-2698 POWER9 Intel, E7-4850
Cores 2× 20 2× 16 4× 14
Host Memory 512 GB 320 GB 2 TB
L2-Cache 256 KB 512 KB 256 KB
L3-Cache 50 MB 10 MB 35 MB
GPU V100 V100 N/A
GPU Memory 32 GB 32 GB N/A
Number of GPUs 8 2 N/A
Pageable H2D 9.2 GB/s 12.2 GB/s N/AD2H 8.0 GB/s 14.2 GB/s N/A
Pinned H2D 10.7 GB/s 60.0 GB/s N/AD2H 12.1 GB/s 60.0 GB/s N/A
Peer to Peer 23.5 GB/s 31.3 GB/s N/A
GPU memory before starting the computation, which causes
significant efficiency issues because of the idle time. Besides,
using that approach [17], one cannot process graphs that are
larger than the GPU’s memory size. [3] addresses this prob-
lem by applying a rectilinear partitioning strategy; however,
their execution time is drastically affected by the number of
partitions. Date et al. [15] proposed to use CPUs and GPUs
in a collaborative fashion using GPU zero-copy memory,
aiming to decrease CPU-GPU data transfer overhead by
leveraging unified memory capabilities. Inefficient use of
GPU memory causes drastic performance decrease in that
case.
Recently, Tom et al. [32] and Acer et al. [33] proposed
distributed-memory triangle counting algorithms. Tom et
al. [32] propose a 2D distributed-memory triangle counting
algorithm that follows similar steps to Cannon’s parallel
matrix multiplication algorithm. Acer et al. [33] propose an
MPI based distributed memory parallel 2D triangle count-
ing algorithm which tries to exploit the benefits of shared
memory parallelism using Cilk.
7 EXPERIMENTAL EVALUATION
We present several experiments to identify the performance
trade-offs of the proposed work. In our tests, we used three
architectures with multi-core processors and multi-GPUs.
Table 2 lists the properties of those architectures. Depending
on the architecture GNU compiler (g++) version 7.2 or Intel
compiler version 19.03 (icpc), CUDA runtime version 10.0
and OpenMP version 4.0 are used to compile and run the
code. The source code of BSD-licensed BBTC is available at
http://tda.gatech.edu/software/bbtc/.
7.1 Used state-of-the-art works for comparison
In our experiments, we compared BBTC’s performance with
three state-of-the-art implementations: (1) TCM 2, a multi-
core triangle counting algorithm proposed by Shun and
2. TCM: https://github.com/Ldhulipala/gbbs
8TABLE 3
Overview of Design Choices. † Complexity for one dimensional parallel
case. Complexity of partitioned version is not provided in [3].
State of The Art
TCM kkTri TriCore BBTC
Algorithmic properties
List Intersect. 3 7 7 3
Map Intersect. 7 3 7 3
Search Intersect. 7 7 3 7
Multi-Core 3 3 7 3
Multi-GPU 7 7 3 3
Complexity O(m
3
2 ) O(m
3
2 ) O(m
3
2 log
√
m)† O(λpc m
3
2 )
Implemented optimizations
Vertex Ordering 3 3 3 3
Compression 3 3 7 7
CUDAStreams N/A N/A 7 3
Parallelization strategy
One-dimension 3 3 3 7
Two-dimension 7 7 3 3
Used runtimes for implementation
Pthread based 3 7 7 7
OpenMP 3 3 7 3
Cilk 3 3 7 3
TBB 7 7 7 3
CUDA 7 7 3 3
Tangwongsan [2] and optimized by Dhulipala et al. [4]. We
use Dhulipala et al. [4]’s optimized version. (2) kkTri 3, a
multi-core linear algebra-based triangle counting algorithm
proposed by Wolf et al. [30] and improved by Yas¸ar et
al. [18]. We use the optimized version on Intel architec-
tures. (3) TriCore 4, a multi-GPU triangle counting algorithm
implemented by Hu et al. [3] and improved in [34]. We
use TriCore’s optimized (latest) version [34]. While TCM
and kkTri algorithms have the optimal complexity, TriCore
and BBTC algorithms have higher complexities due to their
design choices. Note that, the complexity of TriCore is for
the one-dimensional parallel case. TriCore also implements
a variant of rectilinear partitioning [24] to be able to process
graphs that cannot fit into GPU memory. This partitioning
increases the cost of their algorithm. However, the complex-
ity of TriCore’s partitioned version is not provided in [3].
Table 3 presents a high-level overview of the design choices
of these algorithms.
7.2 Dataset and peak rates
We evaluated the BBTC algorithm on 16 real-world, and 8
synthetic graphs (RMAT) from the SuiteSparse Matrix Col-
lection [35], SNAP 5, web data commons 6, and GraphChal-
3. kkTri: https://github.com/Kokkos/kokkos-kernels
4. TriCore: https://github.com/huyang1988/TC
5. SNAP Datasets: http://snap.stanford.edu/data
6. WDC Dataset: http://webdatacommons.org/hyperlinkgraph/
2014-04
lange 7. Table 4 presents, the properties of these graphs:
The number of vertices (|V |), the number of edges (|E|),
the number of triangles, the compression ratio of the graph
(Comp.), the size of the graph in memory (Raw Size),
the number of tiles, and the number of tasks. For all ex-
periments, we get the median of five runs with different
numbers of GPUs and report the best. For a graph, the
rate is defined as the ratio between the execution time and
the number of edges. Best rates are achieved on the DGX
machine. Table 4 also reports rates for each graph.
7.3 Sequential execution time for varying partitions
1 2 4 8 16 32 64
p
0
500
1000
1500
2000
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
Intersect HashMap Hybrid
0
20
40
60
80
100
*p c
c = dmaxd ′max
c = davgd ′avg
(a) Friendster
1 2 4 8 16 32 64
p
0
500
1000
1500
2000
2500
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
Intersect HashMap Hybrid
0
25
50
75
100
125
150
*p c
c = dmaxd ′max
c = davgd ′avg
(b) Twitter
Fig. 4. Sequential time of Friendster and Twitter graphs for different
p values. Black dashed line represents Latapy’s algorithm’s execution
time.
Figure 4 shows the sequential behavior of the BBTC
algorithm with respect to the different number of partitions
(p × p) on the Friendster and the Twitter graphs. Each
bar represents the execution time (left-axis) of the BBTC
algorithm using the list-based intersection, hashmap-based
intersection and hybrid intersection, for a given p (x-axis).
The black dashed line represents the best sequential time
that we get using Latapy’s algorithm [1]. We observe that
in both graphs when p is increased, sequential execution
time first decreases due to better memory utilization and
then increases because of higher load imbalance. Note that,
the BBTC algorithm runs in O(λpc m
3
2 ) time. To analyze the
behaviour of the BBTC algorithm, we have plotted λpc (right-
axis) values for c = dmaxd′max (star-shaped marker) and c =
davg
d′avg
(square-shaped marker) where dmax, davg are the maximum
and the average degrees in the given graph, and d′max, d
′
avg
are the maximum and the average degrees in the blocks of
the partitioned graph, respectively. We observe that trend in
execution times are more co-related with c = davgd′avg . Hence,
one should pick p around the average degree of the graph.
Note that, the BBTC algorithm outperforms Latapy’s algo-
rithm on both graphs for p values closer to davg.
7.4 Comparison with Latapy’s sequential algorithm
Given K , Latapy’s algorithm [1] applies hashmap-based
intersection for vertices whose degrees are higher than K ,
and list-based intersection for the others. We evaluate Lat-
apy’s algorithm’s [1] performance with respect to different
K values. Fig. 5(a) illustrates the performance profile of the
algorithm. In the performance profile, we plot the number
7. Graph Challenge Datasets: https://graphchallenge.mit.edu/
data-sets
9TABLE 4
Properties of the dataset. Best of the medians of execution times in seconds and corresponding rates are reported.
Data Set |V | |E| Triangle Count Comp. Raw Size Tiles Tasks Best (Copy Included)Time (s) Rate (×108)
cit-HepTh 27,770 352,285 1,478,735 0.6 2.2 MB 64 120 0.002 1.7
email-EuAll 265,214 364,481 267,313 0.5 9.5 MB 64 120 0.002 1.9
soc-Epinions1 75,879 405,740 1,624,481 0.6 3.9 MB 64 120 0.002 2.1
cit-HepPh 34,546 420,877 1,276,868 0.5 2.7 MB 64 120 0.002 2.0
soc-Slashdot0811 77,360 469,180 551,724 0.6 5.4 MB 144 364 0.002 2.9
soc-Slashdot0902 82,168 504,230 602,592 0.6 5.7 MB 144 364 0.002 3.1
flickrEdges 105,938 2,316,948 107,987,357 0.6 14 MB 144 364 0.016 1.5
amazon0312 400,727 2,349,869 3,686,467 0.7 28 MB 144 364 0.006 3.9
amazon0505 410,236 2,439,437 3,951,063 0.7 29 MB 144 364 0.007 3.7
amazon0601 403,394 2,443,408 3,986,507 0.7 28 MB 144 364 0.006 4.4
scale18 174,147 3,800,348 82,287,285 0.6 26 MB 256 816 0.021 1.8
scale19 335,318 7,729,675 186,288,972 0.6 56 MB 400 1540 0.041 1.9
as-Skitter 1,696,415 11,095,298 28,769,868 0.8 146 MB 400 1540 0.027 4.1
scale20 645,820 15,680,861 419,349,784 0.6 110 MB 400 1540 0.079 2.0
cit-Patents 3,774,768 16,518,947 7,515,023 0.5 352 MB 400 1540 0.038 4.4
scale21 1,243,072 31,731,650 935,100,883 0.6 216 MB 400 1540 0.144 2.2
soc-LiveJournal1 4,847,571 42,851,237 285,730,264 0.7 534 MB 400 1540 0.121 3.6
scale22 2,393,285 64,097,004 2,067,392,370 0.6 464 MB 576 2600 0.325 2.0
scale23 4,606,314 129,250,705 4,549,133,002 0.5 915 MB 576 2600 0.549 2.4
scale24 8,860,450 260,261,843 9,936,161,560 0.5 1.9 GB 784 4060 1.154 2.3
scale25 17,043,780 523,467,448 21,575,375,802 0.5 4.0 GB 1024 5984 2.400 2.2
twitter 61,578,414 1,202,513,046 34,824,916,864 0.5 13 GB 1296 8436 4.582 2.6
friendster 65,608,366 1,806,067,135 4,173,724,142 0.5 16 GB 1296 8436 3.133 5.8
wdc-2014 1,724,573,718 124,141,874,032 4,587,563,913,535 0.8 650 GB 1296 8436 116.5 12.4
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Relative Performance w.r.t. Best Method
3
7
11
15
19
23
Nu
m
be
r o
f T
es
t I
ns
ta
nc
es
K=0
K=2
K=4
K=8
K=16
K=32
K=64
(a) Perf. Profile of Latapy’s Alg.
scale25 Twitter Friendster
Graphs
0
10
20
30
40
50
Ex
ec
ut
io
n 
Ti
m
e 
(m
in
)
Latapy
kkTri (LL)
kkTri (LU)
TCM
bbTC
(b) Sequential Execution Times
Fig. 5. Evaluation of Latapy’s algorithm wrt. K (5(a)) and comparison of
sequential execution times (5(b)).
of the test instances (y-axis) in which a K value obtains
an execution time on an instance that is no larger than x
times (x-axis) the best execution time achieved by any K
value for that instance [36]. Therefore, the higher a profile
at a given x value, the better a K value is. We observe that
switching between list-based and hashmap-based intersec-
tion techniques improves efficiency. In the majority of the
test instances, K = 2 performs better than others due to the
number of smaller graphs in our dataset. In larger instances,
such as Twitter and Friendster, K = 32 and K = 64 gives
the best performance. Hence, K should be picked based on
the graph size.
Fig. 5(b) illustrates sequential execution times of Lat-
apy’s algorithm, kkTri with LL, and LU formulations [30],
TCM [4], and BBTC on the three large graphs of our dataset;
scale25, Twitter and Friendster. BBTC outperforms other
algorithms in all graph instances due to better memory
utilization. TCM performs the worst because of the list-
based intersection. TCM achieves better scalability in paral-
lel settings with list-based intersection, because using dense
hashmaps (i.e. an array of the length n = |V |) causes
poor memory utilization. However, in a sequential set-
ting list-based intersection performs worse than hashmap-
based intersection because there is no bandwidth competi-
tion among running threads. kkTri uses a linked-list-based
hashmap for large graphs, which helps it to outperform
Latapy’s algorithm on the Friendster graph. We observe
that kkTri’s LU formulation outperforms LL formulation
in the sequential case while LL performs better in parallel
setting [18].
7.5 Task Workload Estimation
Fig. 6(a) and Fig 6(b) present execution times of the most
time consuming 32 tasks (sorted on the x-axis) on a CPU
and a GPU for Friendster and Twitter graphs. We observe
that a GPU executes heavy tasks at least seven times faster
than a CPU. Hence, assigning heavy tasks to GPUs is crucial.
Note that, BBTC doesn’t need to estimate workload of tasks
with full accuracy, just identifying heavy tasks is enough to
increase GPU utilization. Therefore, using a thorough model
to estimate the workload is an overkill. In this experiment,
10
Tasks0.0
0.5
1.0
1.5
2.0
2.5
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
CPU
GPU
(a) Friendster; heaviest tasks.
Tasks0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
CPU
GPU
(b) Twitter; heaviest tasks.
0 8 16 24 32
Top Time Consuming Tasks in CPU
10 1
100
101
102
M
in
. N
u.
 o
f T
as
ks
 to
 C
ov
er
 A
ll 
(%
)(l
og
)
NNZ
Density
Degree
bbTC
(c) Friendster; workload estima-
tion
0 8 16 24 32
Top Time Consuming Tasks in CPU
10 1
100
101
102
M
in
. N
u.
 o
f T
as
ks
 to
 C
ov
er
 A
ll 
(%
)(l
og
)
NNZ
Density
Degree
bbTC
(d) Twitter; workload estimation
Fig. 6. Comparison of Different Estimation Functions
we evaluate the effectiveness of our proposed workload
estimation heuristic (see Sec. 5).
Fig. 6(c) and Fig 6(d) present the least required percent-
age of tasks (y-axis in log-scale) that starts from the begin-
ning of a sorted task list using an estimation function, to
cover most x time-consuming tasks (x-axis) (lower is better).
In this experiment, we compare four different estimation
functions. NNZ: the total number of nonzeros in a task.
Density: the sum of the average nonzeros per unit in the
blocks of a task. Degree: the sum of the average degree
of blocks in a task and BBTC’s estimation function that we
have described in Sec. 5. We observe that BBTC’s estimation
heuristic outperforms the others in 47 of 64 instances. In
the Friendster graph, NNZ outperforms BBTC’s estimation
heuristic in predicting the most time consuming 17 to 32
tasks. However, on the Twitter graph, NNZ performs poorly.
Hence, using NNZ as an estimation function may end up
with unstable performance. In both graphs, not surprisingly,
degree-based estimation performs closer to BBTC’s estima-
tion heuristic due to the effect of vertex degrees on common
neighbor computation.
7.6 Comparison with TriCore
1 2 3 4 5 6 7 8
Number of GPU(s)
0
2
4
6
8
10
12
14
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
TriCore
bbTC (GPU only)
bbTC
(a) Friendster
1 2 3 4 5 6 7 8
Number of GPU(s)
0
5
10
15
20
25
30
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
TriCore
bbTC (GPU only)
bbTC
(b) Twitter
Fig. 7. Comparison with TriCore
In this experiment, we compare BBTC with TriCore using
Friendster and Twitter graphs. Note that, partitioning is
turned off for TriCore as it affects the performance nega-
tively. For Twitter and Friendster graphs we can use TriCore
without partitioning because both graphs can fit into GPUs’
memory. However, when one needs to run larger problem
instances, partitioning is going to be necessary and will
hurt TriCore’s performance due to bad memory utilization.
Each bar in Fig. 7 illustrates the execution time (y-axis) of
an algorithm for different number of GPUs (x-axis). We
observe that BBTC scales better than TriCore with respect
to the number of GPUs thanks to its efforts to overlap com-
munication with computation. TriCore’s scalability becomes
worse after 4 GPUs on Friendster, and its times go up after
4 GPUs on Twitter. On the other hand, BBTC scales much
better up to 5 GPUs and continues to improve slightly,
after 5 GPUs. In Fig. 7, each error-bar presents variation
between minimum and maximum execution times among
five runs. We observe that the TriCore algorithm is highly
unstable, and its runtime deviates up to 40% while this
deviation is less than 5% for BBTC. TriCore’s high deviation
might be caused by unstable warp usage during intersection
computation.
Symmetric rectilinear partitioning may generate sparse
blocks due to the irregularity of the graphs and also a
generalization of the partitioning. Tasks that contain highly
sparse blocks might be executed extremely fast by CPUs and
GPUs. However, processing these tasks on GPUs, come with
an overhead caused by data copy, API calls, and stream syn-
chronization costs. Hence, assigning all tasks on GPUs may
end up with poor performance and also contradicts with
BBTC’s design goals. BBTC targets heterogeneous execution.
As shown in Fig. 7 because of the overhead of the light
tasks in 1 and 2 GPU-only cases, TriCore outperforms BBTC.
Hybrid execution addresses this issue and BBTC achieves
a competitive performance in these cases. Note that, even
in GPU only case, BBTC scales better than TriCore and
outperforms it after 4 GPUs in the system.
7.7 Overlapping communication and computation
1 2 3 4 5 6 7 8
Number of GPUs
0
2
4
6
8
10
12
14
16
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
Computation Time
Communication Time
bbTC (Comm. & Comp. w. Overlap)
(a) Friendster
1 2 3 4 5 6 7 8
Number of GPUs
0
5
10
15
20
25
30
Ex
ec
ut
io
n 
Ti
m
e 
(s
)
Computation Time
Communication Time
bbTC (Comm. & Comp. w. Overlap)
(b) Twitter
Fig. 8. Overlap comparison. Dashed line represents linear speedup.
Fig. 8 illustrates how successfully BBTC overlaps com-
munication with computation in the different number of
GPU settings. In Fig. 8, blue bars represent computation
only execution time (y-axis), and green bars represent com-
munication time between CPUs and GPUs. Orange bars
represent BBTC’s GPU only execution times, which tries
to overlap communication with computation. The black
dashed line represents the linearly scaled execution time
of single GPU computation-only time for the number of
11
GPUs. As expected, BBTC’s computation time scales almost
linearly with the number of GPUs. We observe that on the
Friendster graph, the BBTC algorithm can overlap commu-
nication with computation with just 10% overhead up to 5
GPUs, however this overhead increases for the number of
GPUs and becomes 50% when the number of GPUs is 8,
due to bandwidth limitations. This overlap varies from 1%
to 18% on the Twitter graph.
7.8 Hybrid execution and Cut-off
0 25 50 75 100
Cut-off (%)
1.0
1.5
2.0
2.5
Ra
te
 (x
10
8 ) friendster
twitter
scale25
(a) DGX (2 GPUs)
0 25 50 75 100
Cut-off (%)
1.0
1.5
2.0
2.5
Ra
te
 (x
10
8 ) friendster
twitter
scale25
(b) Newel (2 GPUs)
Fig. 9. Hybrid execution and cut-off.
In this experiment, we evaluate the effect of the cut-off
location (see Fig. 3) from 0 to |T |. If the cut-off is equal
to the number of tasks (|T |), we assign all the tasks to the
GPUs. If the cut-off is 0, then CPUs may execute all the
tasks. In this experiment, we set cut-off as a multiple of
one eight of the number of tasks ( |T |8 ), and report rates
for scale25, Twitter, and Friendster graphs on DGX and
Newell architectures. As illustrated in Fig. 9 when the cut-
off is larger than the last two quartiles, then the overall
performance decreases because the execution of the lighter
tasks on GPUs doesn’t compensate the overhead. The cut-
off creates a one-way barrier and guarantees the assignment
of heavy tasks on GPU by restricting CPUs for not passing
the cut-off point. Hence, the cut-off has crucial importance.
However, thanks to estimation function and relatively fewer
heavy tasks deciding the cut-off point is not a hard problem.
By default, the cut-off point is the middle of the execution
queue.
7.9 Relative speedup
Figure 10 presents relative speedup between this work and
two state-of-the-art algorithms; TCM [2], [4], and kkTri [30].
We compare BBTC with these two state-of-the-art works
for CPU only, and heterogeneous scenarios. In the CPU
only scenario, BBTC outperforms TCM in 23 of 23 cases
and kkTri in 20 of 23 cases. kkTri can perform better than
BBTC in three small instances; email-EuAll, amazon0505
and amazon0611. BBTC can achieve up to 18× and 14×
speedup with respect to TCM and kkTri, respectively. In the
heterogeneous case, BBTC outperforms TCM and kkTri in 23
of 23 cases. However, in four small instances (email-EuAll,
amazon0505, amazon0611, and cit-Patents), TCM and BBTC
are comparable. BBTC can achieve up to 18× speedup on
large graphs and up to 22× speedup on smaller instances.
ci
tH
pT
h
em
Eu
A
ll
Ep
ni
on
s
ci
tH
pP
h
Sl
hd
t8
1
Sl
hd
t9
2
flc
kr
ed
am
zn
32
am
zn
55
am
zn
61
sc
al
e1
8
sc
al
e1
9
as
Sk
tt
r
sc
al
e2
0
ct
Pt
nt
s
sc
al
e2
1
Lv
Jr
nl
1
sc
al
e2
2
sc
al
e2
3
sc
al
e2
4
sc
al
e2
5
tw
it
te
r
fr
nd
st
r0
5
10
15
20
R
el
at
iv
e 
Sp
ee
du
p
bbTC wrt. TCM
bbTC-CPU wrt. TCM
bbTC wrt. kkTri
bbTC-CPU wrt. kkTri
Fig. 10. Relative speedup between this work and TCM and kkTri algo-
rithms, on Newell architecture. bbTC-CPU: speedup when we only use
CPUs. bbTC: speedup when we use CPUs and GPUs. Black dashed
line represents the baseline. Graphs are sorted on x-axis based on their
number of edges.
1 2 3 4
Number of GPUs
1
2
3
4
Sp
ee
du
p
friendster (w NVLink)
friendster (w/o NVLink)
Fig. 11. Effect of bandwidth.
7.10 Effect of bandwidth on speedup
BBTC assigns four CUDA streams per GPU. Each of these
streams asynchronously copies the data from the host mem-
ory to the device memory. The bandwidth between the
host machine and the device becomes a bottleneck when
there are more GPUs on the node. In this experiment, we
evaluate how the bandwidth can affect the scalability. Fig. 11
illustrates BBTC’s strong scalability on Friendster graph up
to four GPUs on a machine with NVLink and a machine
without NVLink. As expected, with NVLink enabled archi-
tecture we get better scaling. Hence, if we would have a
server with NVLink and 8 Volta GPUs with, BBTC’s could
get even better execution times.
7.11 Comparison on the WDC-2014 graph
TABLE 5
Execution times of four different algorithms on two different
architectures on a WDC graph.
Haswell DGX
kkTri 423 Out of memory
TCM 476 Out of memory
TriCore N/A
bbTC 265 116
12
In this experiment, we evaluate the performance of the
BBTC algorithm on a WDC hyperlink graph. We present
runtimes in seconds in Table 5. This graph has 1.7 billion
vertices and 124 billion edges. DGX server doesn’t have
enough memory to execute this graph. Therefore, kkTri
and TCM algorithms failed on DGX. TriCore also raised
an exception during its partitioning stage; consequently, we
couldn’t run TriCore. To be able to process such a graph,
which is larger than available memory, BBTC uses memory-
mapped files. In CPU only execution (Haswell) BBTC out-
performs kkTri and TCM 1.6× and 1.8 times, respectively.
BBTC can process the WDC-2014 graph in 116 seconds
on DGX. Note that, on DGX a hardware RAID system is
installed. Hence, BBTC can perform faster disk I/O.
8 CONCLUSIONS
We have developed a triangle counting algorithm called
BBTC, which leverages a special 2D partitioning scheme
called symmetric rectilinear partitioning to define tasks that
can be easily used on any heterogeneous accelerator, CPU
combination. To leverage from the massive computing ca-
pabilities of the GPUs and to overlap communication with
computation, BBTC implements a dynamic task-scheduling
scheme that schedules tasks on multi-core-CPUs and mul-
tiple streams on multiple GPUs. BBTC is up to 20× faster
than state-of-the-art multi-core CPU algorithms and 1.5×
faster than state-of-the-art multi-GPU algorithm. Our exper-
imental results demonstrate that BBTC achieves the fastest
end-to-end execution times when the graph is not on the
GPU.
Acknowledgments: This work was supported in parts by
the NSF grants CCF-1919021 and Sandia National Lab-
oratories/Sandia Corp contract 2115504. Sandia National
Laboratories is a multi-mission laboratory managed and
operated by National Technology and Engineering Solutions
of Sandia, LLC., a wholly owned subsidiary of Honeywell
International, Inc., for the U.S. Department of Energy’s
National Nuclear Security Administration under contract
DE-NA-0003525.
REFERENCES
[1] M. Latapy, “Main-memory triangle computations for very large
(sparse (power-law)) graphs,” Theoretical Computer Science, pp.
458–473, 2008.
[2] J. Shun and K. Tangwongsan, “Multicore triangle computations
without tuning,” in 2015 IEEE 31st International Conference on Data
Engineering. IEEE, 2015, pp. 149–160.
[3] Y. Hu, H. Liu, and H. H. Huang, “Tricore: Parallel triangle count-
ing on gpus,” in SC18: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 171–
182.
[4] L. Dhulipala, G. E. Blelloch, and J. Shun, “Theoretically efficient
parallel graph algorithms can be fast and scalable,” in Proceedings
of the 30th on Symposium on Parallelism in Algorithms and Architec-
tures. ACM, 2018, pp. 393–404.
[5] J. W. Berry, L. K. Fostvedt, D. J. Nordman, C. A. Phillips, C. Se-
shadhri, and A. G. Wilson, “Why do simple algorithms for triangle
enumeration work in the real world?” in Theoretical Computer
Science, 2014, pp. 225–234.
[6] A. Itai and M. Rodeh, “Finding a minimum circuit in a graph,”
SIAM Journal on Computing, vol. 7, no. 4, pp. 413–423, 1978.
[7] M. Ortmann and U. Brandes, “Triangle listing algorithms: Back
from the diversion,” in Proceedings of the Meeting on Algorithm
Engineering & Expermiments. SIAM, 2014, pp. 1–8.
[8] N. Alon, R. Yuster, and U. Zwick, “Finding and counting given
length cycles,” Algorithmica, vol. 17, no. 3, pp. 209–223, 1997.
[9] R. Pagh and F. Silvestri, “The input/output complexity of triangle
enumeration,” in Proceedings of the 33rd ACM SIGMOD-SIGACT-
SIGART symposium on Principles of database systems. ACM, 2014,
pp. 224–233.
[10] J. Cohen, “Trusses: Cohesive subgraphs for social network analy-
sis,” National security agency technical report, pp. 3–1, 2008.
[11] A. Prat-Pe´rez, D. Dominguez-Sal, J. M. Brunat, and J.-L. Larriba-
Pey, “Shaping communities out of triangles,” in Proceedings of
the 21st ACM international conference on Information and knowledge
management. ACM, 2012, pp. 1677–1681.
[12] J.-P. Eckmann and E. Moses, “Curvature of co-links uncovers
hidden thematic layers in the world wide web,” Proceedings of the
national academy of sciences, vol. 99, no. 9, pp. 5825–5829, 2002.
[13] N. Wang, J. Zhang, K.-L. Tan, and A. K. Tung, “On triangulation-
based dense neighborhood graph discovery,” Proceedings of the
VLDB Endowment, pp. 58–68, 2010.
[14] C. E. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and
C. Faloutsos, “Spectral counting of triangles via element-wise
sparsification and triangle-based link recommendation,” Social
Network Analysis and Mining, pp. 75–81, 2011.
[15] K. Date, K. Feng, R. Nagi, J. Xiong, N. S. Kim, and W.-M. Hwu,
“Collaborative (CPU + GPU) algorithms for triangle counting
and truss decomposition on the minsky architecture: Static graph
challenge: Subgraph isomorphism,” in 2017 IEEE High Performance
Extreme Computing Conference (HPEC). IEEE, 2017, pp. 1–7.
[16] A. Yas¸ar, S. Rajamanickam, J. W. Berry, M. M. Wolf, J. Young, and
U¨. V. C¸atalyu¨rek, “Linear algebra-based triangle counting via fine-
grained tasking on heterogeneous environments,” in High Perfor-
mance Extreme Computing Conference (HPEC), 2019 IEEE. IEEE,
2019.
[17] O. Green, P. Yalamanchili, and L.-M. Munguı´a, “Fast triangle
counting on the GPU,” in Proceedings of the 4th Workshop on
Irregular Applications: Architectures and Algorithms. IEEE Press,
2014, pp. 1–8.
[18] A. Yas¸ar, S. Rajamanickam, M. M. Wolf, J. W. Berry,
and U¨. V. C¸atalyu¨rek, “Fast triangle counting using cilk,”
in High Performance Extreme Computing Conference (HPEC),
2018 IEEE. IEEE, 2018, pp. 1–7. [Online]. Available: http:
//ieeexplore.ieee.org/document/8547563
[19] E. G. Boman, K. D. Devine, and S. Rajamanickam, “Scalable
matrix computations on large scale-free graphs using 2d graph
partitioning,” in SC’13: Proceedings of the International Conference on
High Performance Computing, Networking, Storage and Analysis, 2013,
pp. 1–12.
[20] E. Saule, E. O. Bas, and U¨. V. C¸atalyu¨rek, “Load-balancing
spatially located computations using rectangular partitions,”
Journal of Parallel and Distributed Computing, vol. 72, no. 10, pp.
1201–1214, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.
jpdc.2012.05.013
[21] U¨. V. C¸atalyu¨rek, C. Aykanat, and B. Uc¸ar, “On two-dimensional
sparse matrix partitioning: Models, methods, and a recipe,” SIAM
Journal on Scientific Computing (SISC), vol. 32, no. 2, pp. 656–683,
2010. [Online]. Available: http://dx.doi.org/10.1137/080737770
[22] E. Kayaaslan, C. Aykanat, and B. Uc¸ar, “1.5D parallel sparse
matrix-vector multiply,” SIAM Journal on Scientific Computing,
vol. 40, no. 1, pp. C25–C46, 2018.
[23] S. Acer, O. Selvitopi, and C. Aykanat, “Optimizing nonzero-based
sparse matrix partitioning models via reducing latency,” Journal of
Parallel and Distributed Computing, vol. 122, pp. 145–158, 2018.
[24] M. Grigni and F. Manne, “On the complexity of the generalized
block distribution,” in International Workshop on Parallel Algorithms
for Irregularly Structured Problems. Springer, 1996, pp. 319–326.
[25] A. Yas¸ar and U¨. V. C¸atalyu¨rek, “Heuristics for symmetric rectilin-
ear matrix partitioning,” ArXiv, Tech. Rep. arXiv:1909.12209, Oct
2019. [Online]. Available: http://arxiv.org/abs/1909.12209
[26] D. P. O’leary and G. Stewart, “Data-flow algorithms for parallel
matrix computation,” Communications of the ACM, vol. 28, no. 8,
pp. 840–853, 1985.
[27] B. Hendrickson, R. Leland, and S. Plimpton, “An efficient parallel
algorithm for matrix-vector multiplication,” International Journal of
High Speed Computing, vol. 7, no. 01, pp. 73–88, 1995.
[28] E.-J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization frame-
work for sparse matrix kernels,” The International Journal of High
Performance Computing Applications, vol. 18, no. 1, pp. 135–158,
2004.
13
[29] G. Teodoro, T. D. R. Hartley, U¨. V. C¸atalyu¨rek, and
R. Ferreira, “Run-time optimizations for replicated dataflows
on heterogeneous environments,” in Proc. of the 19th
ACM International Symposium on High Performance Distributed
Computing (HPDC), 2010, pp. 13–24. [Online]. Available: http:
//portal.acm.org/GraphBasedCitationAnalysis.cfm?id=1851479
[30] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Ra-
jamanickam, “Fast linear algebra-based triangle counting with
kokkoskernels,” in IEEE High Performance extreme Computing Con-
ference (HPEC). IEEE, 2017, pp. 1–7.
[31] M. Deveci, C. Trott, and S. Rajamanickam, “Performance-
portable sparse matrix-matrix multiplication for many-core archi-
tectures,” in Parallel and Distributed Processing Symposium Work-
shops (IPDPSW). IEEE, 2017, pp. 693–702.
[32] A. S. Tom and G. Karypis, “A 2D parallel triangle counting
algorithm for distributed-memory architectures,” in International
Conference on Parallel Processing. ACM, 2019, p. 45.
[33] S. Acer, A. Yas¸ar, S. Rajamanickam, M. M. Wolf, and U¨. V.
C¸atalyu¨rek, “Scalable triangle counting on distributed-memory
systems,” in High Performance Extreme Computing Conference
(HPEC), 2019 IEEE. IEEE, 2019.
[34] Y. Hu, H. Liu, and H. H. Huang, “High-performance triangle
counting on gpus,” in IEEE High Performance extreme Computing
Conference (HPEC). IEEE, 2018, pp. 1–5.
[35] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection,” ACM Transactions on Mathematical Software (TOMS),
p. 1, 2011.
[36] E. D. Dolan and J. J. More´, “Benchmarking optimization software
with performance profiles,” Mathematical programming, vol. 91,
no. 2, pp. 201–213, 2002.
Abdurrahman Yas¸ar is a PhD Student in the School of Computational
Science and Engineering at Georgia Institute of Technology. He received
his M.S. in Computer Engineering from Bilkent University, Turkey in
2015.
Sivasankaran Rajamanickam (M14) is a Principal Member of Tech-
nical Staff in the Center for Computing Research at Sandia National
Laboratories. He earned his B.E. from Madurai Kamaraj University,
India, and his Ph.D. in Computer Engineering from University of Florida.
Jonathan Berry is a Distinguished Member of the Technical Staff at
Sandia National Laboratories. He holds a Ph.D. in computer science
from Rensselaer Polytechnic Institute and spent almost a decade in
liberal arts academia before joining Sandia in 2004.
U¨mit V. C¸atalyu¨rek (M09,SM10,Fellow16) is a Professor and Associate
Chair in the School of Computational Science and Engineering at the
Georgia Institute of Technology. He received his Ph.D., M.S. and B.S. in
Computer Engineering and Information Science from Bilkent University,
Turkey.
