SIMD-X: Programming and Processing of Graph Algorithms on GPUs by Liu, Hang & Huang, H. Howie
SIMD-X: Programming and Processing of Graph Algorithms on GPUs
Hang Liu
University of Massachusetts Lowell
H. Howie Huang
The George Washington University
Abstract
With high computation power and memory bandwidth,
graphics processing units (GPUs) lend themselves to ac-
celerate data-intensive analytics, especially when such
applications fit the single instruction multiple data
(SIMD) model. However, graph algorithms such as
breadth-first search and k-core, often fail to take full ad-
vantage of GPUs, due to irregularity in memory access
and control flow. To address this challenge, we have
developed SIMD-X, for programming and processing
of single instruction multiple, complex, data on GPUs.
Specifically, the new Active-Compute-Combine (ACC)
model not only provides ease of programming to pro-
grammers, but more importantly creates opportunities
for system-level optimizations. To this end, SIMD-X
utilizes just-in-time task management which filters out
inactive vertices at runtime and intelligently maps var-
ious tasks to different amount of GPU cores in pursuit
of workload balancing. In addition, SIMD-X leverages
push-pull based kernel fusion that, with the help of a
new deadlock-free global barrier, reduces a large num-
ber of computation kernels to very few. Using SIMD-X,
a user can program a graph algorithm in tens of lines of
code, while achieving 3×, 6×, 24×, 3× speedup over
Gunrock, Galois, CuSha, and Ligra, respectively.
1 Introduction
The advent of big data exacerbates the need of extract-
ing useful knowledge within an acceptable time enve-
lope. For performance acceleration, many applications
utilize graphics processing units (GPUs) whose huge
success comes from exploiting the data-level parallelism
in these applications. Implicitly, the traditional sin-
gle instruction multiple data (SIMD) model of GPUs as-
sumes regular programming and processing, that is, not
only the same instruction is executed but also the same
amount of work is expected to perform on each piece of
data. Unfortunately, neither assumption holds true for
many emerging irregular applications, especially graph
analytics which is the focus of this work. That is, such
applications do not conform to the SIMD model, where
different amount of work, or worse, completely different
work, need to be performed on the data in parallel.
To enable graph computation on GPUs, this work ad-
vocates a new parallel framework, SIMD-X, for the pro-
gramming and processing of single instruction multiple,
complex, data on GPUs. At the heart of SIMD-X is
the decoupling of programming and processing, that is,
SIMD-X utilizes the data-parallel model for ease of ex-
pressing of graph applications, while enabling system-
level optimizations at run time to deal with the task-
parallel complexity on GPUs. With SIMD-X, a program-
mer simply needs to define what to do on which data,
without worrying about the issues arisen from irregular
memory access and control flow, both of which prevent
GPUs from achieve massive parallelism.
SIMD-X consists of three major components: First,
SIMD-X utilizes a new Active-Compute-Combine (ACC)
programming model that asks a program to define three
data-parallel functions: the condition for determining an
active vertex, computation to be performed on an associ-
ated edge, and combining the updates from edge compute
to vertex state. As we will show later, ACC is able to sup-
port a large variety of graph algorithms from breadth-first
search, k-core, to belief propagation. While ACC adopts
the Bulk Synchronous Parallel (BSP) model, it differs
from traditional CPU-based graph abstractions such as
edge- or vertex-centric models in that ACC avoids atomic
operation, enables collaborative early termination (for
BFS) and fine-grained task management on GPUs.
Second, SIMD-X relies on just-in-time (JIT) task man-
agement to balance parallel workloads across different
GPU cores with minimal overhead. A good task list can
increase not only parallelism, but also sequential mem-
ory access for the computation of next iteration, both
of which are crucial for high-performance computing
on GPUs. To this end, we have designed a set of new
task management mechanisms of online and ballot filters,
1
ar
X
iv
:1
81
2.
04
07
0v
1 
 [c
s.D
C]
  1
0 D
ec
 20
18
0
a b c d e f g h i
Distance array
Vertex
b
f
ca
ed
g h i
1
5
1
2
1
1
2
3 4 6
b
f
ca
ed
g h i
b
f
ca
ed
g h i
1 5 6 1 3
a b c d e f g h i
b
f
ca
ed
g h i
b
f
ca
ed
g h i
1 4 6 1 3 4 6 7 9
a b c d e f g h i
1 4 5 1 3 4 6 7 9
a b c d e f g h i
b
f
ca
ed
g h i
0 4 5 1 3 4 6 7 9
a b c d e f g h i
0 5 1
a b c d e f g h i
(a) Initialization (c) Iteration 2 (d) Iteration 3 (e) Iteration 4 (f) Iteration 5(b) Iteration 1
Updated vertex
Active vertex
1
5
1
2
1
1
2
3 4 6
1
5
1
2
1
1
2
3 4 6
1
5
1
2
1
1
2
3 4 6
1
5
1
2
1
1
2
3 4 6
1
5
1
2
1
1
2
3 4 6
Figure 1: SSSP on a graph, with nine vertices {a, b, c, d, e, f, g, h, i} and ten undirected edges (with weights). SSSP iteratively
computes on the graph and generates the distance array. Particularly, heavy and light shadows represent active and most recently
updated vertices, respectively.
each of which excels at the complementary scenarios,
i.e., the former favors a small amount of tasks while the
latter larger tasks. At runtime, SIMD-X judiciously se-
lects the more suitable filter to assemble the active work
list for the next iteration. Our JIT task management can
largely reduce the memory consumption, thereby accom-
modate the graphs much larger than prior work [38, 63].
Moreover, SIMD-X is able to deliver 16× speedup, on
average, across various algorithms and graphs.
Third, SIMD-X designs a new technique of push-pull
based kernel fusion which aims to further accelerate
graph computing by reducing kernel invocation overhead
and global memory traffic. SIMD-X addresses the dead-
lock issue which occurs, when fusing kernel across the
software global barrier [65], in existing work such as
Gunrock [61], CuSha [25], and StreamScan [67]. Be-
sides, instead of aggressively fusing the algorithm into
one giant kernel, SIMD-X fuses the kernels around the
pull and push stages within each computation to mini-
mize both register consumption and kernel relaunching.
The evaluation shows that the new fusion technique can
reduce the register consumption by half and thus dou-
ble the configurable thread count, leading to 42% and
25% performance improvement over non-fused and ag-
gressive fusion, respectively.
SIMD-X is different from prior work in several as-
pects. First, in order to use existing systems efficiently,
a programmer needs to possess an in-depth knowledge
of GPU architecture [12, 1], e.g., Gunrock requires ex-
plicit management of GPU threads and memory [63],
and B40C [38] and Enterprise [29] need thousands of
lines of CUDA code for BFS specific optimizations. One
of the goals of this work is to provide a simple pro-
gramming model and delegate the responsibility of task
management to SIMD-X. Second, current systems ei-
ther ignore workload imbalance as in [25, 76], or re-
solve it reactively as in [63, 59], both of which result in
undesired system performance. Lastly, because GPUs
lack support for global synchronization, existing sys-
tems [60, 63, 29, 31, 56] either rely on the multi-kernel
design or runtime tunning, both of which come with con-
siderable overhead, especially for graph algorithms with
high iteration count. SIMD-X addresses these challenges
with the help of new filters, as well as a deadlock-free
software barrier.
The rest of this paper is organized as follows: Sec-
tion 2 presents the challenges of constructing SIMD-X on
GPUs. Section 3 describes the ACC model. Section 4
presents our just-in-time management approach and Sec-
tion 5 discuses the kernel fusion design. We present the
graph algorithms in Section 6 and the evaluation results
in Section 7. Section 8 discusses the related work and
Section 9 concludes.
2 SIMD-X Challenges and Architecture
2.1 Graph Computing on GPUs
Generally speaking, regular applications present uni-
form workload distribution across the data set. As a
result, such applications lend themselves to the data-
parallel GPU architecture. For development and eval-
uation, this work mainly uses NVIDIA GPUs, which
have tens of streaming processors and in total thou-
sands of Compute Unified Device Architecture (CUDA)
cores [1, 44]. Typically, a warp of 32 threads execute the
same instruction in parallel on consecutive data. For reg-
ular application, programming and processing is simple,
e.g., dense matrix algebra as shown in Figure 2(a).
On the other hand, task management for irregular ap-
plications is challenging on GPUs. In this work, we fo-
cus on a number of graph algorithms such as breadth-first
search, k-core, and belief propagation. Here we use one
algorithm – Single Source Shortest Path (SSSP) – to il-
lustrates the challenges. Simply put, a graph algorithm
computes on a graph G = (V , E, w), where V , E and w
are the sets of vertices, edges, and edge weights. The
computation updates the algorithmic metadata which are
the states of vertices or edges in an iterative manner. A
typical workflow of SSSP is shown in Figure 1. Initially,
SSSP assigns the infinite distance to each vertex in the
distance array, which is represented as blank in the fig-
ure. Assuming the source vertex is a, the algorithm as-
signs 0 as its initial distance, and now vertex a becomes
active. Next, SSSP computes on this vertex, that is, cal-
culating the updates for all the neighbors of vertex a. In
2
GPU
(b) Irregular application(a) Regular application
Dense matrix
GPU
x x x x
x x x x
x x x x
b
f
ca
ed
g h i
1
5
1
2
1
1
2
3 4 6
Graph algorithm
Figure 2: Mapping regular versus irregular applications on
GPUs
this case, vertices {b, d} have their distances updated
to 5 and 1 in the distance array. At the next iteration,
the vertices with newly updated distances become active
and perform the same computation again. This process
continues until no vertex gets updated. Different from
breadth-first search, SSSP may update the distances of
some vertices across multiple iterations, e.g., vertex b is
updated in iteration 1 and 3.
In this example, not every vertex is active at all time,
and vertices with different degrees (number of edges)
yield varying amounts of workloads. For instance, at it-
eration 3 of Figure 1(d), one thread working on vertex
c computes two neighbors, while another thread on ver-
tex e four neighbors. As a result, a complex mapping
as shown in Figure 2(b) is required for high-performance
processing, and to do so necessitates in-depth knowledge
from a programmer on GPUs.
2.2 Architecture
SIMD-X is motivated to achieve two goals simultane-
ously: providing ease of programming for a large variety
of graph algorithms, whereas enabling fine-grained op-
timization of GPU resources at the runtime. Figure 3
presents an overview of SIMD-X architecture. To achieve
the first goal, SIMD-X utilizes a simple yet powerful
Active-Compute-Combine (ACC) model. This data-
parallel API allows a programmer to implement graph
algorithms with tens of lines of code (LOC). Prior work
requires significant programming effort [38, 29, 63], or
runs the risk of poor performance [25].
In SIMD-X, high-performance graph processing on
GPUs is achieved through the development of two com-
ponents: (1) JIT task management, which is responsi-
ble for translating data-parallel code to parallel tasks on
GPUs. Essentially, SIMD-X “filters” the inactive tasks
and groups similar ones to run on the underlying SIMD
architecture. In particular, SIMD-X develops online and
ballot filters for handling different types of tasks, and dy-
namically selects the better filter during the execution of
the algorithm. And (2) Pull-push based kernel fusion.
Graph applications are iterative in nature and thus require
synchronizations. Fusing kernels across iterations would
yield indispensable benefits, because kernel launching at
GPU
BFS BP k-Core
ACC programming model
Deadlock-free software global barrier
JIT control
Selective
kernel
fusion
Ballot filterOnline filter
Just-In-Time task management
Push-Pull based kernel fusion
PageRank SpMV SSSP
SIMD-X
…
Graph algorithm
Figure 3: SIMD-X architecture
each iteration incurs non-trivial overhead. In SIMD-X,
we observe that with aggressive kernel fusion, register
consumption would increase dramatically, lowering the
occupancy and thus performance. To this end, SIMD-
X deploys kernel fusion around pull and push stages of
each graph computation, seeking a sweet spot that not
only maximizes the range of each kernel fusion but also
minimizes the register consumption. It is worthy noting
that we also address the deadlock issue faced by software
global barrier in SIMD-X.
3 ACC Programming Model
When it comes to graph computing, there are two main
programming models: vertex-centric vs. edge-centric.
“Think like a vertex” [37, 75] focuses on tasks on active
vertices in a graph, whereas “think like an edge” [49, 48]
iterates on edges and simplifies programming. SIMD-X
aims to achieve the dual goal of ease of programming in
edge-centric model, and efficient workload scheduling in
vertex-centric model.
3.1 Motivation
Graph programming converges to either vertex-centric
or edge-centric models. In particular, the vertex-centric
model contains two functions: vertex scatter defines
what operations should be done on this vertex, and
vertex gather applies the updates on the vertex. This
model has been adopted by a number of existing projects,
e.g., Pregel [37], GraphLab [33], PowerGraph [14],
GraphChi [28], FlashGraph [75], Mosaic [35], and Grid-
Graph [77], as well as GPU-based implementation such
as CuSha [25] and Gunrock [63]. On the other hand,
the edge-centric model is initially introduced by the
external-memory graph engine X-stream [49] to improve
IO performance. It requires a programmer to define
two functions needed on each edge, edge scatter and
edge gather. As such, this model schedules threads by
the edge count. Particularly, one thread needs to send the
information of the source vertex and the outbound edge
to the destination vertex (edge scatter), which atomi-
cally applies the new updates in edge gather.
In this work, we believe the many-threaded nature of
GPU architecture demands a new abstraction. We in-
3
Table 1: Comparison between ACC and relevant GPU-based programming models. denotes desirable feature.
StagesAbstraction Related Work Task filtering Workload balancing Avoid atomic operation Graph format
ICU CuSha [25], Lux [22] Init/Compute (Edge) Update Edge list
ICRU WS [24] Init/Compute (Edge) Reduce/IsUpdate CSR
AFC Gunrock [63] Advance/Filter Compute (Vertex, with atomic update) CSR
GAS GTS [26], GraphReduce [50] Gather (Edge) Apply/Scatter Edge list
ACC SIMD-X Active Compute (Edge) Combine CSR
tend to exploit various thread scheduling options to bet-
ter tackle workload imbalance [29, 63], while minimiz-
ing the overhead with regards to atomic operations on
GPUs [34]. Table 1 summarizes the designs of recent
GPU-based graph analytics systems. To avoid wasting
the threads to compute on inactive vertices, task filter-
ing is essential in generating a list of active vertices.
Once task lists are ready, workload imbalance caused by
skewed degree distribution in many graphs becomes the
next concern. Since handling this issue in a vertex cen-
tric model involves nontrivial programming efforts [29],
edge-based computing presents a desirable alternative.
However, traditional edge-centric approach would result
in atomic updates at the destination vertex, thus a proper
schedule before applying the update is essential to avoid
atomic operation. It is also important to note that com-
pressed sparse row (CSR) is a preferable graph format
which can save around 50% of the space over edge list
format, as contemporary GPUs only feature tens of GB
memory [1]. The proposed ACC framework is designed
to address these three challenges.
3.2 ACC Model
The new ACC model contains three functions: Active,
Compute, and Combine. ACC supports a wide range of
graph algorithms and requires much fewer lines of code
compared to prior work. In this following, we will dis-
cuss the three functions.
Active allows a programmer to specify the condition
whether a vertex is active. Formally it can be defined:
∃v← active(Mv,v)
where v is the vertex ID and Mv represents its metadata.
Depending on the algorithm, the Active function may
vary. Belief propagation (BP) is simple which treats all
vertices as active. In comparison, SSSP, as shown in Fig-
ure 4(a), considers the vertices active when their current
metadata differs from the prior iteration.
Simply put, SIMD-X distinguishes active vertices from
inactive ones, and focuses on the calculation needed for
each vertex. This is different from the vertex-centric
model which deals with not only the active vertex but
also its neighbors. Because two vertices may have dif-
ferent numbers of neighbors, existing systems [37, 14]
likely suffer from workload imbalance. To this end,
SIMD-X leverages a classification technique, similar to
Enterprise [29], to group the active vertices depending
on the expected workload.
Compute defines the computation that happens on each
edge. In particular, it specifies the operations on the
metadata of edge (v, u) and two vertices v and u, which
can be written as follows:
updatev→u← compute(Mv,M(v,u),Mu)
where the return value of updatev→u will be used by the
Combine function. For example, SSSP computes the up-
dated distance for the destination vertex as shown in Fig-
ure 4(a).
Combine merges all the updates, once the computations
are completed. It can be represented:
updateu← ⊕
v∈Nbr[u]
updatev→u
where ⊕ must be commutative and associative, e.g., sum
and minimum, and is being applied to all the neighbors
of vertex u. Figure 4(a) presents the Combine examples
of SSSP. Particularly, BP summarizes all updates, where
SSSP combines all updates from compute by selecting
the minimum.
SIMD-X optimizes two types of combine operations,
i.e., aggregation and voting. Particularly, aggregation
cannot tolerate overwrites, that is, all updates are needed
for computing the results. PageRank, SSSP and k-Core
are representative examples of such operation. In con-
trast, voting relaxes this condition, that is, the algorithm
is correct as long as one update is received because all
updates are identical. BFS, weakly connected compo-
nent and strongly connected component algorithms [54]
fall into this category.
3.3 Processing with ACC
This section uses SSSP an example to illustrate how
the SIMD-X framework works. SSSP computes the short-
est paths between the source vertex and the remaining
vertices of the graph. Although similar to Breadth-First
Search (BFS), SSSP is more challenging as only one ver-
tex with the shortest distance should be computed at one
time. To improve the parallelism, we adopt the delta-
step [39] algorithm which permits us to simultaneously
compute a collection of the vertices whose distances are
relatively shorter. We assume positive edge weights.
4
Init (vertex src){
• metadata_curr[src] = 0;
• large_list.insert (src);
}
Active (vertex v){
• return metadata_curr[v] != metadata_prev[v];
}
Compute (edge e, weight w){
• old_dist = metadata_curr[e.dest];
• new_dist = metadata_curr[e.src] + w;
• return old_dist > new_dist ? new_dist: old_dist;
}
Combine (metadata_t *A){
• return min(A);
}
Warp (Compute, Combine, Active, overflow)
• for each active vertex v in med_list: //warp in parallel
• for each neighboring edge set e[32] to vertex v: 
• res[lane_id] = Combine (Compute(e[lane_id]));
• final = Combine (res);
• if lane_id == 0:
• metadata_curr[v] = final;
• small_bin, med_bin, large_bin = 
online_filter (Active, v, overflow);
Thread() //Similar to Warp, but with one thread working on 
one active vertex
CTA() //Similar to Warp, but with one CTA working on one 
active vertex
SSSP_main{
Init (src);
while condition:
•Thread (Compute, Combine, Active, overflow)
•Warp (Compute, Combine, Active, overflow)
•CTA (Compute, Combine, Active, overflow)
•__software_global_barrier ();
•if (overflow):
• ballot_filer (Active);
•else:
• small_list, med_list, large_list = 
prefix_scan (small_bin, med_bin, large_bin);
•__software_global_barrier ();
}
(a) SSSP in ACC
(b) ACC in SIMD-X
1:
2:
3:
4:
5:
6:
7:
8:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Figure 4: (a) Single-Source Shortest Path expressed in ACC
model and (b) ACC abstraction within SIMD-X framework.
As shown in line 12 - 21 of Figure 4(b), SIMD-X struc-
tures graph computation as a loop. Similar to popular
GPU-based frameworks [63, 25, 24], ACC follows BSP
model, that is, synchronization is required at the end of
each iteration. As we will discuss in the next section,
SIMD-X employ three kernels to balance the workload,
Thread, Warp and CTA kernels working on small list,
med list and large list, respectively. During computing,
the online filter (Section 4) attempts to track the active
vertices with the thread bins (i.e., small bin, med bin and
large bin). Note that each active vertex is stored in one of
these three bins based upon its degree. After a deadlock
free software global barrier (Section 5), SIMD-X checks
 0.9
 1
 1.1
 1.2
 1.3
 1.4
FB ER KR LJ OR PK RD RC RM UK TW Avg.
Sp
ee
du
p
Vote
Aggregation
Figure 5: Speedup of our ACC model over Gunrock. Note vote
and aggregation operations are materialized by BFS and SSSP
algorithms, respectively.
whether an overflow happens in any of the thread bins,
which leads to either a ballot filter-based active lists gen-
eration or a simple prefix-scan based concatenation of all
thread bins to produce the active lists (line 17-21).
In Figure 4(b), Line 1 - 8 exemplifies the interactions
between ACC and SIMD-X. Firstly, SIMD-X will sched-
ule a warp of threads to work on the neighbors of one
active vertex from med list. Similarly, Thread and CTA
will schedule a thread and CTA to work on each active
vertex from small list and large list, respectively. During
computation, each thread will conduct a local Compute
and Combine at line 4. Once finished, a cross Warp Com-
bine happens at line 5. Eventually, the first thread from
the Warp applies the final updates (without atomic oper-
ation) and store this vertex (if active) into corresponding
thread bins.
Comparison The new ACC model follows a computa-
tion and then combine approach which pays the extra
overhead (i.e., assembling all updates residing in shared
memory from participating threads) in order to achieve
the benefits of atomic-free updates. Gunrock, in con-
trast, directly applies computation update to vertex sta-
tus with atomic operations, thereby avoids inter-thread
communication but experiences heavier overhead from
atomic operation. Figure 5 studies the performance im-
pact of ACC vs Gunrock. One can see that ACC is, on
average, 12% and 9% faster on vote and aggregation op-
erations, respectively. For vote, the speedup comes from
the fact that ACC can schedule all threads to collabora-
tively determine early termination, which is not possi-
ble in Gunrock. For aggregation, the performance gains
comes from the elimination of atomic updates.
4 Just-In-Time Task Management
Task management is essential for graph applications.
The key to success is to ensure good workload balance on
GPUs, that is, each GPU core, regardless of from which
streaming processor, accounts for a similar amount of
workload. To this end, we adopt a two-step approach –
task management and thread assignment. In step I: task
management, the tasks are classified into various lists,
namely small list, med list and large list. In step II:
thread assignment, various granularity of GPU threads
5
1101 100001
Current active list
Neighbor list b
d
d a c e b f a e b d f g h i e c e e e
0 4 6 1 3 4 6 7 9
a b c d e f g h i
Updated metadata
e e e e e e c c
b d f g h i b fActive edge list
Next active list
Thread bin
0 4 6 1 3 4 6 7 9
a b c d e f g h i
ballotballotballot ballotballot
(a) Batch filter (b) Ballot filter
0 5 6 1 3
a b c d e f g h i
b
d
d a c e b f a e b d f g h i e c e e e
(c) Online filter
e c
b
d
d a c e b f a e b d f g h i e c e e e
c e b d
b f h f g i
b f h f g
b f g h i
b f g h i
c e e
c e
Active vert
Thread 0
Thread 1
Updated vert
Update
vertex status
Batch edges
Adjacent scan,
ballot vote
result to 1 thread
Update 
vertex status
Update 
vertex status
Record updated 
vertex
Record updated 
vertex
Record updated 
vertex
a1
a2
a3
b1
b2
b3
c1
c2
Sorted
Sorted
Unsorted
Unsorted,
Redundanti
Figure 6: Three task management methods. Particularly, batch filter and ballot filter work on Figure 1(d) to produce a task list for
next iteration. Online filter does that for Figure 1(c).
are scheduled to work different worklists. That is, a sin-
gle thread per small task, a warp per medium task and
a CTA per large task. Unlike prior work [29, 63, 38],
SIMD-X focuses on improving the first step as it is the
major culprit that offsets the benefits of workload bal-
ancing. In the following, we will first analyze the draw-
back of existing batch filter method, then describe our
proposed two new filters, as well as the JIT selection
mechanism.
Drawback of batch filter. This approach [63, 38, 10]
first loads all the edges of the active vertices to construct
an active edge list. Still using the example of SSSP in
Figure 1(c), this step loads neighbors of vertex {e, c}
and constructs the active edge list in a1 of Figure 6 (a).
Next, batch filter checks these edges and updates vertex
metadata a2 , followed by recording the updated vertices
in thread bin at step a3 . Eventually, batch filter will con-
catenate these thread bins to arrive at a potentially un-
sorted and redundant next active list – {b, f , h, f , g, i}.
Note, thread private local storage – thread bin – is used
to avoid the expensive atomic operations, because mul-
tiple threads would need atomic operation to put active
vertices directly into next active list.
We observe several drawbacks when using the batch
filter for various graph algorithms. First, the active list
can consume up to 2·|E|memory space because majority
of the vertices in a graph can become active at one iter-
ation [4, 29], which is especially true for popular social
and web graphs. Considering GPU has very limited on-
board memory (e.g., 16 GB), this restriction makes large-
scale GPU-based graph computing intractable. Second,
batch filter produces a worklist with unsorted and redun-
dant active vertices, e.g., next active list – {b, f , h, f , g,
i} of Figure 6(a), which will lead to poor memory per-
formance for next iteration computation.
Ballot filter is designed to overcome all these shortcom-
ings. It first loads the neighbors of active vertices and
immediately updates vertex metadata. As shown at step
b1 in Figure 6(b), the neighbors of {e, c} get updated
immediately. Afterwards, thread 0 and 1 (red and blue
lines) will exploit ballot scan to inspect the updated meta-
data and record those updated vertices in local thread bin
at step b3 . The eventual step is similar to batch filter –
we will concatenate these two thread bins to arrive at the
next active list, whereas, with sorted and unique active
vertices.
Ballot scan is the key to comprehend why we arrive at
a better next active list. In steps b2 and b3 of Figure 6(b),
threads 0 and 1 perform coalesced scan of vertex meta-
data, and with the CUDA ballot() primitive, return a
bit variable ‘01’ to the first thread. Here 1 means ac-
tive and 0 otherwise, in this case, vertex a is not active
while b is. Through collaboratively working on the en-
tire metadata array, the first thread eventually gets the bit
value ‘0100’ for the first four vertices, while the second
thread ‘011110’ for the remaining six vertices. Conse-
quently, this approach produces a sorted active list, that
is, {b, f , g, h, i} in b3 .
We intentionally schedule thread 0 and 1 to collabo-
ratively scan the metadata in order to achieve coalesced
memory access during scan, as well as, making thread
0 and 1 account for a continuous range of vertices, that
is, vertices a - d to thread 0 and e - i to thread 1. This
achieves the dual benefits: coalesced scan and sorted ac-
tive vertices in next active list. Last but not the least, this
scheduling lends ballot filter to be many-thread safe.
We also notice an unpublished parallel efforts from
Khorasani’s dissertation [23] which is closely related to
ballot filter. However, his design relies on atomic op-
eration to compute the offsets of active vertices from
each Warp in the next active list and subsequently as-
signs merely a single thread from the Warp to enqueue all
these active vertices. This design implies twin disadvan-
6
Large_listMed_list
Online filter
Ballot filter
Small_list
I: JIT task 
management
Overflow ? Yes
Thread Warp CTA
256 threads32 threads1 thread
No
II: Workload 
balance
Iteration ++
Figure 7: Just-in-time task management.
BFS
FB
ER
KR
LJ
OR
PK
RD
RC
RM
UK
TW
k-Core
FB
ER
KR
LJ
OR
PK
RD
RC
RM
UK
TW
SSSP
FB
ER
KR
LJ
OR
PK
RD
RC
RM
UK
TW
Iteration
29
21
38
25
20
68
21
34
2,578
555
5,086
675
: Online filter : Ballot filter
Figure 8: Ballot filter activation patterns.
tages comparing to ours. First, atomic operation-based
offset computation cannot yield sorted active lists. Sec-
ond, single thread-based active vertices recording tends
to be slower than Warp-based one which is our design.
Ballot filter is not without its own issue, especially
when the amount of active vertices is low. In that case,
scanning the metadata array would account for the ma-
jority of the runtime. For instance, in ER and RC graphs,
99.23% and 96.67% of the time is spent on scanning
metadata in ballot filter alone solution, respectively.
Online filter is designed to accommodate the issue faced
by ballot filter. In the first step, this method loads the ac-
tive neighbors, updates the destination vertex, and simul-
taneously records the active vertices in the thread bin. In
the last step, it assembles all thread bins together as the
next active list. When the number of active vertices is
small, this approach turns out to be extremely fast. Here
we use the early stage of SSSP as an example to explain
its working process. As shown in Figure 6(c), {b, d} are
active vertices, this approach loads their neighbors for
computation ( c1 ), and immediately records the destina-
tion vertices. Eventually, it generates {e, c} as the active
list for the next iteration as shown in c2 . It is also impor-
tant to note that for online filter, the vertices in the active
��
��
��
��
��
��
�� ��� ��� ���� ����� �����
�������������������
��
����
���
���
���
��
���
����������������������
����������
��
��
��
��
��
��
�� �� �� �� �� �� �� �� �� �� ��
��
���
���
�
��
�
(a) Relative performance. (b) JIT Overhead
Figure 9: The (a) relative performance of JIT management with
respective to various online filter overflow thresholds on BFS
and (b) the overhead of JIT on SSSP.
list may become redundant, and out of order.
In graph computing, it is possible that one GPU thread
may encounter exceeding amount of active vertices, e.g.,
our tests on Twitter graph shows one GPU thread can
reap more than 4,096 active vertices. Clearly, one can-
not afford such a large thread bin for all threads, thus
online filter will inevitably suffer from an overflow prob-
lem. Fortunately, ballot filter largely avoids this issue be-
cause it first updates the metadata of active vertices b2 ,
which, to some extent, averages out the active vertices
across threads in step b3 . Our evaluation also demon-
strates such a difference.
Just-In-Time control adaptively exploits ballot and on-
line filters to retain the best performance. As shown in
Figure 7, SIMD-X always activates the online filter first.
Once a thread bin overflows, SIMD-X will switch on bal-
lot filter to generate the correct task list for the next itera-
tion. Interestingly, we find out that various algorithms
and graph datasets present different selection patterns
which tie closely to the amount of workload, that is, the
higher volume of workload often results in the activation
of ballot filter. As shown in Figure 8, BFS and SSSP typ-
ically use the ballot filter in the middle of the computa-
tion and online filter at the beginning and end. For high-
diameter graphs, BFS and SSSP avoid the use of ballot
filter. For instance, ER and RC always use the online fil-
ter along 2,578, 555, 5,086 and 675 iterations. k-Core
activates the ballot filter at the initial iterations, i.e., typ-
ically the first two iterations except RC which only ex-
periences one iteration because all its vertices have < 16
neighbors. At the extreme, BP and PageRank need the
ballot filter at exactly the first iteration of computation.
Overflow thresholds for online filter. Clearly, this pa-
rameter directly determines when to switch on ballot fil-
ter, thereby affects the overall performance. Figure 9(a)
presents the normalized performance with respect to var-
ious thresholds. As expected, a too low or too high
threshold limits the performance because in either case,
SIMD-X is forced to switch to ballot filter either too early
or too late, leading to performance penalty. As such, in
this work we select 64 as the predefined overflow thresh-
old for all algorithms.
7
Overhead of online filter. After switching to ballot fil-
ter, JIT task management also executes the online fil-
ter in case it needs to switch back. Figure 9(b) studies
the overhead of this design. On average, there is 0.02%
slowdown, with the maximum of 2.1% observed for the
OR graph. The reason for the small overhead is because
online filter only tracks upto 64 (predefined threshold)
active vertices for the next iteration and this operation is
not on the critical path of the execution.
Classification of small, medium and large worklists.
Given GPU thread granularity, we initialize the small,
medium and large worklists to be warp and block sizes
(i.e., 32 and 128), respectively. Our further investigation
shows, for the separator of small to medium worklists,
the performance stays stable in the range of [4, 128], and
for medium and large worklists [128, 2048]. We find the
performance starts to drop beyond these ranges.
5 Push-Pull Based Kernel Fusion
Kernel fusion [60], a common optimization for a col-
lection of iterative GPU applications, such as graph com-
puting and deep learning [2, 46, 21, 9, 7], reduces ex-
pensive overhead of kernel invocation, as well as mini-
mizes the global memory traffic as the life time of regis-
ters and shared memory is limited in each kernel. How-
ever, traditional efforts, such as Gunrock [63] and Xiao
et al [65], fail to achieve cross the global barrier kernel
fusion. This section starts with our observation and anal-
ysis of potential deadlock in the mainstream global bar-
rier design [65, 67] and subsequently introduces a light-
weighted deadlock free solution which enables the global
thread synchronization within the fused kernel. How-
ever, aggressive kernel fusion requires a large amount
of the registers and thus supports fewer parallel warps
which could hurt the overall performance. To this end,
we introduce a push-pull based kernel fusion strategy to
minimize both the kernel invocation times and register
consumption.
Software global barrier is needed to enable the bal-
anced kernel fusion. Generally speaking, this approach
uses an array – lock – to synchronize all GPU threads
upon arrival and departure. During the processing, it as-
sumes the thread CTA as the monitor while the remaining
threads as workers. At arrival, each worker CTA updates
its own status in lock. Once all worker CTAs have ar-
rived, the monitor changes the statuses of all CTAs to
departure, allowing all threads to proceed forward.
This approach, unfortunately, suffers from potential
deadlock [65], as illustrated in Figure 10. Specifically,
the worker thread CTAs may hold all GPU hardware
resources, such as streaming processors, registers and
shared memory, while waiting for the monitor to update
the lock array. In the meantime, the monitor cannot up-
R
R
Holding
Worker CTA
…
Hardware resources
Monitor CTA
…
Update 
lock array
Responsible for
R
Waiting
$
$
$
C
C
C
Figure 10: Deadlock problem in software global barrier, where
‘C’, ’$’, and ‘R’ represent CUDA core, L1 cache and register,
respectively.
date the lock array, due to lack of hardware resources
(e.g., thread over subscription).
Compiler-based deadlock free barrier. SIMD-X uti-
lizes the barrier in a way to ensure that every CTA,
regardless of a work or the monitor, can obtain hard-
ware resources when needed. This is achieved through
comparing the resources needed by the kernels, against
the total available resources. Based on the GPU ar-
chitecture, we can obtain the total amount of regis-
ters (#registerPerSMX) that can be provided by each
streaming processor, e.g., 65,536 registers of NVIDIA
K40 GPUs and 32,768 from K20 GPUs. On the
other hand, we can collect the register consumption
(#registerPerT hread) of each kernel at the compilation
stage. Putting these numbers together, SIMD-X is able
to calculate the appropriate thread configuration for the
kernels.
The number of CTA can be computed as follows:
#CTA= f loor(
#registersPerSMX
#registersPerT hread ·#threadsPerCTA )·#SMX
(1)
where #threadsPerCTA is configured by a user, i.e.,
128 by default. For example, when deploying a ker-
nel, each thread consumes 110 registers, and on K40
that contains 15 SMXs, each of which contains 65,536
registers. If #threadsPerCTA is set to 128, one gets
#CTA = ceil( 65536110×128 )× 15 = 60. As a result, we can
configure this kernel as CTA and thread count per CTA
as 60 and 128, respectively.
Notably, portable Inter-Block Barrier [56] is closely
relevant to our effort. However, this method pro-
poses extremely complicated thread block management
mechanism that requires to distinguish whether one
thread block will execute useful workloads or not dur-
ing runtime. This requires nontrivial programmer efforts
and scheduling overhead. In comparison, our method
achieves this deadlock-free configuration before runtime
and is completely transparent to the end users.
Push-Pull based kernel fusion. As shown in Table 2,
the register consumption (using the compilation flag -
Xptxas -v) increases from average 25 to 110, that is 4.4×
before and after kernel fusion. It becomes clear that we
8
Table 2: Register consumption for various kernels.
Kernel Push (no fusion) Pull (no fusion) Selective fusion All fusionThread Warp CTA Task mgt Thread Warp CTA Task mgt push pull
Register consumption 26 27 28 24 24 24 22 30 48 50 110
Kernel launching count up to 40,688 3 1
Begin
Push model: JIT task management
(a) All fusion (b) Selective fusion
End
Thread push Warp push
Pull model: JIT task management
CTA push
Thread pull Warp pull CTA pull
Multiple
iterations
Switch model
Multiple
iterations
Begin
Push model: JIT task management
End
Thread push Warp push
Pull model: JIT task management
CTA push
Thread pull Warp pull CTA pull
Multiple
iterations
Switch model
Multiple
iterations
Figure 11: Consecutive iterations from graph algorithms often
cluster to push and model computation separately: (a) all fu-
sion, (b) selective fusion.
need a balanced fusion strategy that keeps both regis-
ter consumption and kernel invocation low. To this end,
SIMD-X leverages the push-pull model used in the graph
algorithms. That is, such algorithms often use push or
pull based computing in several consecutive iterations.
For example, BFS and SSSP utilize push in the first and
last iterations, and pull in between. In contast, k-Core
conducts pull at the beginning while push in the end.
The idea of push-pull based kernel fusion is to fuse
kernels around the pull and push computing. In other
words, for the push-based iterations, SIMD-X fuses dif-
ferent compute kernels (for thread, warp, CTA), as well
as task management kernel, into one push kernel. The
kernel only terminates when the computation finishes or
it needs to switch to pull computing according to the cri-
terion discussed in Section 3.3. Similar optimizations are
done for the pull-based iterations.
Using the new push-pull based fusion, the register con-
sumption decreases to 48 and 55 thus increases the con-
figurable thread count by 50%. Together, our evalua-
tions demonstrate a 25% performance improvement in
Figure 13 of Section 7.2.
Table 2 presents the register consumption and kernel
invocation of different kernel fusion techniques. By us-
ing the push-pull based kernel fusion, the kernel relaunch
is reduced to 3 while its register consumption is cut by
half. In Figure 13, we will later show that this technique
brings upto 80% performance benefit.
6 Graph Algorithms and Datasets
In addition to SSSP that is discussed in Section 3.3,
this section further presents a variety of algorithms which
are implemented on SIMD-X to examine the expressive-
ness of ACC programming model, and performance im-
pacts of task management and kernel fusion techniques.
BFS [29] traverses a graph level by level. At each level, it
loads all neighbors that are connected to vertices visited
in the preceding level, inspects their statuses (metadata),
and subsequently marks those unvisited neighbors as ac-
tive for the next iteration. Notably, BFS conducts syn-
chronizations at the end of each level, relies on vote to
combine the updates. During the entire process of traver-
sal, BFS typically experiences light workload at the be-
ginning and end of the computation while heavy work-
load in the middle.
Belief propagation (BP), also known as sum-product
message passing algorithm, infers the posterior probabil-
ity of each event based on the likelihoods and prior prob-
abilities of all related events. Once modeled as a graph
(Bayesian network or Markov random fields), each event
becomes a vertex with all incoming vertices and edges
as related events and corresponding likelihoods. In BP,
vertex possibility is the metadata.
k-Core (KC), which is widely used in graph visualiza-
tion application [30, 41], iteratively deletes the vertices
whose degree is less than k until all remaining vertices in
this graph possess more than k neighbors. k-Core experi-
ences large volume of workloads at initial iterations and
follows with light workloads. This work uses a default
value of k = 16.
PageRank (PR) [45] updates the rank value of one ver-
tex based on the contribution of all in-neighbors itera-
tively till all vertices have stable rank values. Because the
contributions of in neighbors are summarized to the des-
tination vertex, we start PageRank with the pull model
and agg sum as the merge operation. At the end of
PageRank, we switch to the push model because the ma-
jority of the vertices are stable [72]. The switch is de-
cided by a decision tree.
Storage Format. SIMD-X employs compressed sparse
row (CSR) format to store the graph. For undirected
graph, we only need to store the out-neighbors of each
vertex. For directed graph, we store both out-neighbors
and in-neighbors of each vertex to support the push and
pull based processing.
Graph Benchmarks. We evaluate on a wide range
of graphs as shown in Table 3, which falls into four
types, i.e., social networks, road maps, hyperlink web
and synthetic graphs. Particularly, Facebook [13], Live-
9
Table 3: Graph Dataset
Graph Name Abbrev. Vertex Count Edge Count
Facebook FB 16,777,215 775,824,943
Europe-osm ER 50,912,018 108,109,319
Kron24 KR 16,777,216 536,870,911
LiveJournal LJ 4,847,571 136,950,781
Orkut OR 3,072,626 234,370,165
Pokec PK 1,632,803 61,245,127
Random RD 4,000,000 511,999,999
RoadCA-net RC 1,971,281 5,533,213
R-MAT RM 3,999,983 511,999,999
UK-2002 UK 18,520,343 596,227,523
Twitter TW 25,165,811 787,169,139
Journal [55], Orkut [55], Pokec [55], and Twitter [27]
are common social networks. Europe-osm [11] and
RoadCA-net [57] are two large roadmap graphs, and UK-
2002 [57] is a web graph. Furthermore, we use Graph500
generator to generate Kron24 [5], and GTgraph [15] for
R-MAT and random graphs. Europe-osm and RoadCA-
net are high diameter graphs, with 2570 and 555 as their
diameters, respectively. LiveJournal, Pokec, Twitter and
UK-2002 are medium diameter graphs, i.e., 10 - 30 as
their diameters. The diameters of the remaining graphs
are all smaller than 10. For graphs without edge weight,
we use a random generator to generate one weight for
each edge similar to Gunrock [63].
7 Experiments
We implement SIMD-X1 with 5,660 lines of CUDA
and C++ code. All the algorithms presented in Section 6
are implemented with around 100 lines of C++ code. The
source code is compiled by GCC 4.8.5 and NVIDIA nvcc
7.5 with the optimization flag as O3. In this work, we
evaluate SIMD-X on a Linux workstation with two In-
tel Xeon E5-2683 CPUs (14 physical cores with 28 hy-
perthreads), and 512GB main memory. Throughout the
evaluation, we use uint32 as the vertex ID and uint64 as
index and evaluate our system on NVIDIA K40 GPUs
unless otherwise is specified. We also test SIMD-X on
earlier K20 and latest P100 GPUs. The timing is started
once the graph data is loaded in GPU global memory.
Each result is reported with an average of 64 runs.
7.1 Comparison with State-of-the-art
Table 4 summarizes the runtime of SIMD-X against
Galois and Gunrock which are state-of-the-art CPU and
GPU graph processing systems, respectively, as well
as CuSha (GPU) and Ligra (CPU), two popular graph
frameworks. The take aways of this table are two folds.
First, SIMD-X is both space efficient and robust. As
one can see, since CuSha requires edge list as the in-
1SIMD-X will be released in open source upon the paper publication.
put for computation, it cannot accommodate large graphs
(e.g., FB and TW) across all algorithms. Besides, since
Gunrock requires large amount of space for batch filter, it
suffers out of memory (OOM) error for all larger graphs
in SSSP. Even CPU systems (Galois and Ligra) enjoys
affluent memory space (512 GB) from CPU, they can-
not converge to a result for high diameter graphs. That
is, Galois cannot converge for SSSP on ER while Ligra
fails to obtain result for BFS on UK graph.
Second, SIMD-X outperforms all graph processing
frameworks. In general, SIMD-X is 24×, 2.9×, 6.5× and
3.3× faster than CuSha, Gunrock, Galois and Ligra, re-
spectively. In BFS, SIMD-X bests CuSha, Gunrock, Ga-
lois and Ligra by 9.6×, 4.8×, 2.1× and 2.4×, respec-
tively. We also notice that SIMD-X is slower than Ga-
lois on the RD graph because workload balancing brings
negligible benefits to uniform-degree graph (RD). Also,
SIMD-X is slightly worse than Ligra on RM graph since
this graph only has a diameter of four thus both the op-
timization of JIT task management and kernel fusion
brings trivial benefits to GPU based graph systems, as
evident by the much lower performance on CuSha and
Gunrock.
In PageRank, SIMD-X achieves 1.2×, 2.1×, 2.3× and
4× speedups over CuSha, Gunrock, Galois and Ligra,
respectively. Note, even CuSha cannot support all large
graphs due to large memory space consumption, it per-
forms roughly similar to SIMD-X with even outperform-
ing SIMD-X on LJ and OR. This is generally because
PageRank tends to be computation intensive and needs
to compute all edges, curbing the benefits of task man-
agement and kernel fusion. However, edge list format
(of CuSha) doubles the memory consumption, facing the
crisis of OOM for all large graphs.
For SSSP, SIMD-X wins 21×, on average, over all four
projects. We project SIMD-X to better outperform all sys-
tems than observed for BFS algorithm because SSSP ex-
periences more iterations with larger volume of active
tasks, placing more favor towards ACC model, JIT task
management and push-pull based kernel fusion. How-
ever, because Gunrock fails to accommodate all large
graphs, our benefits cannot surface – ending with merely
1.8× speedup. Second, CuSha spends 519,674 ms on
the high diameter ER graph which is 480× slower than
SIMD-X because task management is absent from CuSha.
We also notice Galois performs better than SIMD-X in
RD, again, due to the uniform degree distribution phe-
nomenon.
For k-Core, where k = 32, SIMD-X wins Ligra by 20×.
Such a striking advantage comes from three parts. First,
as reflected by Figure 12(b), k-Core generates exten-
sive amount of workload variations thus benefits tremen-
dously from JIT task management. Second, k-Core’s
iterative nature also enjoys the benefits from push-pull
10
Table 4: Runtime (ms) of SIMD-X and Gunrock, and Galois. A K40 GPU is used to test SIMD-X and Gunrock, and a CPU with 28
threads for Galois. The blank space indicates the test cannot complete for the given algorithm and graph.
Alg System FB ER KR LJ OR PK RD RC RM UK TW Avg. speedup
SIMD-X 198 400 130 59 40 20 82 15 47 308 210 -
CuSha 988 224 341 72 435 297 674 4298 9.6
Gunrock 685 849 677 71 225 44 647 146 506 312 697 4.8
Galois 482 1068 140 139 42 34 48 53 65 229 322 2.1
BFS
Ligra 1086 1426 176 89 51 31 88 48 40 496 2.4
SIMD-X 1553 346 1141 236 435 118 1105 13 800 637 1525 -
CuSha 1704 182 323 180 1402 15 886 1.2
Gunrock 3004 884 3129 275 927 166 2963 43 2208 784 3180 2.1
Galois 4552 603 3069 424 1061 218 3576 20 2067 842 4178 2.3
PR
Ligra 16780 1368 2000 1324 1786 310 809 35 1703 9360 4
SIMD-X 1816 1080 998 284 604 143 1505 223 478 703 1344 -
CuSha 519674 1663 692 1120 260 1610 438 1236 62
Gunrock 1206 1220 431 1259 336 5059 229 1.8
Galois 161596 8485 1785 1166 356 747 3440 5877 9081 1818 15
SSSP
Ligra 14067 3043 2893 1627 1567 605 3353 301 2783 1300 5217 3.7
SIMD-X 366 78 131 60 63 33 10 4 19 151 277 -k-Core Ligra 6337 1167 2813 1707 1700 654 27 36 235 6627 5783 20
based kernel fusion, as shown in Figure 13(c). Lastly,
the flexibility of ACC allows innovative k-Core algo-
rithm designs – we will stop further subtracting the de-
gree of destination vertex once the destination vertex’s
degree goes below k – this reduces tremendous unneces-
sary updates. Note comparisons of Belief Propagation,
as well as other systems for k-Core are not included be-
cause those systems fail to support such algorithms.
7.2 Benefits of Various Techniques
This section studies the performance impacts brought
by JIT task management and push-pull based kernel fu-
sion. As we have presented in Section 4, JIT task
management only works for applications that experience
workload variations, that is, BFS, k-Core and SSSP. On
the other hand, push-pull based kernel fusion is applica-
ble for all five algorithms
On average, JIT task management presented in Fig-
ure 12, is 16×, 26× and 4.5× faster than the ballot fil-
ter for BFS, k-Core and SSSP. As expected, online fil-
ter alone cannot work for many graphs, particularly large
ones, e.g., Facebook, Twitter and UK2002 graphs in BFS
and SSSP. Without considering overflow problem (ER
and RC graphs), JIT task management adds a small 1-2%
overhead on top of the online filter on BFS and SSSP.
On k-Core, JIT task management is, on average,
28.5× and 5% faster than ballot and online filter, re-
spectively. We also observe that the ballot filter outper-
forms the online filter on ER and RC graphs by 3.4× and
1.7×, because k-Core removes a large volume of vertices
which favors the former to produce a non-redundant and
sorted work list.
Push-pull based kernel fusion brings, on average, 43%
and 25% improvement over non-fusion and all-fusion
across all algorithms and graphs. In particular, push-pull
based kernel fusion tops non-fusion by 74%, 11%, 85%,
10% and 66% on BFS, BP, k-Core, PageRank and SSSP.
BFS, k-Core and SSSP achieves more performance gains
because they are not computation intensive and tend to
run a higher number of iterations. For all fusion, our
new kernel fusion is 55%, 6%, 62%, 25% and 11% faster
on BFS, BP, k-Core, PageRank and SSSP. It is impor-
tant to note that all fusion is not always beneficial, i.e.,
all fuse option of PageRank is average 13% slower than
no fusion because all fusion limits the amount of config-
urable threads. However, for memory intensive applica-
tions, like BFS and SSSP on ER and RC, all fusion is on
average 2× better.
7.3 Performance on Different GPUs
We also evaluate SIMD-X, Gunrock and CuSha on var-
ious GPU models, such as K20 and P100 GPUs. It is not
surprising to see tht SIMD-X presents the biggest perfor-
mance gain on the latest GPUs. In detail, SIMD-X on K40
and P100 performs 1.7× and 5.1× better than K20 GPU.
In contrast, Gunrock merely gets 1.1× and 1.7× per-
formance improvement when moving from K20 to K40
and P100, respectively. Similarly for CuSha, its perfor-
mance on K40 and P100 are 1.2× and 3.5× better than
K20, respectively. The root cause of this disparity is that
SIMD-X’s kernel fusion technique can dynamically con-
figure its GPU kernels to the fitting thread count on the
corresponding hardware so as to achieve the peak perfor-
mance. For instance, the thread count increases by 1.2×
and 5.1× on K40 and P100 than on K20 for BFS.
8 Related Work
Recent advance in graph computing falls in algorithm
innovation [39, 72], framework developments [37, 14,
33, 28, 30, 75, 77, 18, 53, 51, 49, 19, 42, 48, 61, 6, 66, 68,
52, 73, 62, 43, 74, 70, 69, 3, 64, 40, 17, 8] and acceler-
ator optimizations [63, 29, 38, 25, 71, 47]. This section
covers relevant work from three aspects: programming
model, task management and kernel fusion.
Besides edge and vertex centric models, there are also
other models that make various trade-offs between sim-
plicity and performance. For instance, “think like a
graph” [58] requires each vertex to obtain the view of
11
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
132 132 2.4 29 29 2.6
Sp
ee
du
p
Ballot
Online
JIT
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
2.8 2.3 2.7 8 8 14 14 4 2.1
Sp
ee
du
p
Ballot
Online
JIT
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
30 30 2.6 2.6
Sp
ee
du
p
Ballot
Online
JIT
(a) BFS (b) k-Core (c) SSSP
Figure 12: Benefit of just-in-time task management, normalized to the performance of the ballot filter.
��
����
��
����
��
�� �� �� �� �� �� �� �� �� �� ��
��� ��� ������ ���
��
���
��
������������������������������������
(a) BFS
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
Sp
ee
du
p
Non-fusion
All-fusion
Push-pull fusion
(b) BP
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
2.7 2.9 2.7 2.2
Sp
ee
du
p
Non-fusion
All-fusion
Push-pull fusion
(c) k-Core
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
Sp
ee
du
p
Non-fusion
All-fusion
Push-pull fusion
(d) PageRank
 0
 0.5
 1
 1.5
 2
FB ER KR LJ OR PK RD RC RM UK TW
2.9 3.7 29 29
Sp
ee
du
p
Non-fusion
All-fusion
Push-pull fusion
(e) SSSP
Figure 13: Benefit of push-pull based kernel fusion, normalized to the performance of no fusion.
the entire partition on one machine in order to minimize
the communication cost. Furthermore, domain specific
programming language systems, such as Galois [42],
Green-Marl [19] and Trinity [51], allow programmers to
write single-threaded source code while enjoying multi-
threaded processing. In comparison, SIMD-X decouples
the goal of programming simplicity and performance:
with ACC, SIMD-X ultimately designs a data-parallel ab-
straction for deploying irregular graph applications on
GPU. With JIT task management and push-pull based
kernel fusion, SIMD-X pushes the performance towards
a magnitude faster than state-of-the-art CPU and GPU
frameworks.
Task management is an important optimization for
GPU-based graph computing. Besides batch filter [63,
38], there also exist other task management approaches
– strided filter [29, 31] and atomic filter [34]. Particu-
larly, strided filter resembles ballot filter but the former
one experiences strided memory access when scanning
the metadata thus performs up to 16× worse than ballot
filter. Atomic filter relies is similar to online filter but
it relies on atomic operation to put active vertices into
global active list which suffers from orders of magnitude
slow down than online filter. Besides ballot and online
filter bests batch, stride and atomic filter, SIMD-X goes
further via introducing a JIT controller to adaptively use
online filter and ballot filter to further improve the per-
formance. We also find that JIT task management can be
exploited to help manage active lists for other applica-
tions such as warp segmentation [24] and CSR5 [32].
Kernel fusion affects applications far beyond graph
computations. SIMD-X demonstrates its benefits in
graph computing and Belief Propagation (BP) applica-
tions. SIMD-X is closely related to global software bar-
rier [65, 67]. However, previous work fails to identify the
deadlock issue in this global software barrier problem,
thus no solution towards this issue. In contrast, SIMD-X
unveils, systematically analyzes, and resolves this prob-
lem. To avoid high register consumption, SIMD-X further
selectively fuse kernels via exploiting the special kernel
launching patterns of graph algorithms. It is also im-
portant to mention existing work [60] that only fuse ker-
nels to barrier boundary. In comparison, SIMD-X fuses
kernels across barriers. Our design can also benefit the
popular Persistent Kernel [16] designs which have been
found suffer from deadlock issues when the occupancy
exceed an unknown bound [36, 20].
9 Conclusion
In this work, we propose SIMD-X, a parallel graph
computing framework that supports programming and
processing of single instruction multiple, complex, data
on GPUs. Specifically, the Active-Compute-Combine
(ACC) model provides ease of programming to program-
mers, while just-in-time task management and push-
pull based kernel fusion leverage the opportunities for
system-level optimization. Using SIMD-X, a user can
program a graph algorithm in tens of lines of code, while
achieving significant speedup over the state-of-the-art.
12
Acknowledgment
Hang Liu did part of this work as Graduate Research
Assistant at the George Washington University. This
work was partially supported by National Science Foun-
dation CAREER award 1350766 and grants 1618706
and 1717774 at George Washington University. This re-
search used resources from XSEDE and Amazon AWS
research credits at University of Massachusetts Lowell.
References
[1] Nvidia cuda c programming guide. NVIDIA Cor-
poration, 2011.
[2] Martı´n Abadi, Paul Barham, Jianmin Chen,
Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu
Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, et al. TensorFlow: A System
for Large-Scale Machine Learning. In OSDI,
volume 16, pages 265–283, 2016.
[3] Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xue-
hai Qian, Kang Chen, and Weimin Zheng. Squeez-
ing out all the value of loaded data: An out-of-
core graph processing system with reduced disk
i/o. In 2017 USENIX Annual Technical Conference
(USENIX ATC 17),(Santa Clara, CA), pages 125–
137, 2017.
[4] S Beamer, K Asanovic, and D Patterson. Direction-
optimizing Breadth-First Search. In International
Conference for High Performance Computing, Net-
working, Storage and Analysis (SC), pages 1–10.
IEEE, 2012.
[5] Deepayan Chakrabarti, Yiping Zhan, and Christos
Faloutsos. R-MAT: A Recursive Model for Graph
Mining. In SDM, 2004.
[6] Rong Chen, Xin Ding, Peng Wang, Haibo Chen,
Binyu Zang, and Haibing Guan. Computa-
tion and communication efficient graph process-
ing with distributed immutable view. In Proceed-
ings of the 23rd international symposium on High-
performance parallel and distributed computing,
pages 215–226. ACM, 2014.
[7] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan
Wang, Minjie Wang, Tianjun Xiao, Bing Xu,
Chiyuan Zhang, and Zheng Zhang. Mxnet: A
flexible and efficient machine learning library for
heterogeneous distributed systems. arXiv preprint
arXiv:1512.01274, 2015.
[8] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan
Miao, Xuetian Weng, Ming Wu, Fan Yang, Li-
dong Zhou, Feng Zhao, and Enhong Chen. Ki-
neograph: taking the pulse of a fast-changing and
connected world. In Proceedings of the 7th ACM
european conference on Computer Systems, pages
85–98. ACM, 2012.
[9] Sharan Chetlur, Cliff Woolley, Philippe Vandermer-
sch, Jonathan Cohen, John Tran, Bryan Catanzaro,
and Evan Shelhamer. cudnn: Efficient primitives
for deep learning. arXiv preprint arXiv:1410.0759,
2014.
[10] Andrew Davidson, Sean Baxter, Michael Garland,
and John D Owens. Work-efficient parallel GPU
methods for single-source shortest paths. In 28th
International Symposium on Parallel & Distributed
Processing (IPDPS), pages 349–359. IEEE, 2014.
[11] European Open Stream Map. http://download.
geofabrik.de/europe-latest.osm.bz2,.
[12] Benedict R Gaster and Lee Howes. Can GPGPU
Programming Be Liberated from the Data-Parallel
Bottleneck? Computer, 2012.
[13] Minas Gjoka, Maciej Kurant, Carter T Butts, and
Athina Markopoulou. Practical Recommendations
on Crawling Online Social Networks. IEEE Jour-
nal on Selected Areas in Communications, 2011.
[14] Joseph E Gonzalez, Yucheng Low, Haijie Gu,
Danny Bickson, and Carlos Guestrin. PowerGraph:
Distributed Graph-Parallel Computation on Natural
Graphs. In OSDI, volume 12, page 2, 2012.
[15] GTgraph: A suite of synthetic random graph gen-
erators. http://www.cse.psu.edu/~madduri/
software/GTgraph/.
[16] Kshitij Gupta, Jeff A Stuart, and John D Owens.
A study of persistent threads style GPU program-
ming for GPGPU workloads. In Innovative Parallel
Computing (InPar), 2012, pages 1–14. IEEE, 2012.
[17] Wentao Han, Youshan Miao, Kaiwei Li, Ming
Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran,
Wenguang Chen, and Enhong Chen. Chronos:
a graph engine for temporal graph analysis. In
Proceedings of the Ninth European Conference on
Computer Systems, page 1. ACM, 2014.
[18] Wook-Shin Han, Sangyeon Lee, Kyungyeol Park,
Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, and
Hwanjo Yu. TurboGraph: a fast parallel graph en-
gine handling billion-scale graphs in a single PC. In
Proceedings of international conference on Knowl-
edge discovery and data mining (SIGKDD), pages
77–85, 2013.
13
[19] Sungpack Hong, Hassan Chafi, Edic Sedlar, and
Kunle Olukotun. Green-Marl: a DSL for easy and
efficient graph analysis. In Proceedings of the sev-
enteenth international conference on Architectural
Support for Programming Languages and Operat-
ing Systems (ASPLOS), volume 40, pages 349–362,
2012.
[20] Derek R Hower, Blake A Hechtman, Bradford M
Beckmann, Benedict R Gaster, Mark D Hill,
Steven K Reinhardt, and David A Wood.
Heterogeneous-race-free memory models.
ACM SIGARCH Computer Architecture News,
42(1):427–440, 2014.
[21] Yangqing Jia, Evan Shelhamer, Jeff Donahue,
Sergey Karayev, Jonathan Long, Ross Girshick,
Sergio Guadarrama, and Trevor Darrell. Caffe:
Convolutional architecture for fast feature embed-
ding. In Proceedings of the 22nd ACM interna-
tional conference on Multimedia, pages 675–678.
ACM, 2014.
[22] Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat
McCormick, Mattan Erez, and Alex Aiken. A Dis-
tributed Multi-GPU System for Fast Graph Pro-
cessing. Proceedings of the VLDB Endowment,
11(3):297–310, 2017.
[23] Farzad Khorasani. High Performance Vertex-
Centric Graph Analytics on GPUs. PhD Disser-
tation: University of California, Riverside, 2016.
[24] Farzad Khorasani, Rajiv Gupta, and Laxmi N
Bhuyan. Scalable simd-efficient graph processing
on gpus. In Parallel Architecture and Compilation
(PACT), 2015 International Conference on, pages
39–50. IEEE, 2015.
[25] Farzad Khorasani, Keval Vora, Rajiv Gupta, and
Laxmi N Bhuyan. CuSha: vertex-centric graph pro-
cessing on GPUs. In Proceedings of the 23rd inter-
national symposium on High-performance parallel
and distributed computing, pages 239–252. ACM,
2014.
[26] Min-Soo Kim, Kyuhyeon An, Himchan Park,
Hyunseok Seo, and Jinwook Kim. GTS: A fast and
scalable graph processing method based on stream-
ing topology to GPUs. In Proceedings of the 2016
International Conference on Management of Data,
pages 447–461. ACM, 2016.
[27] Haewoon Kwak, Changhyun Lee, Hosung Park,
and Sue Moon. What is Twitter, a social network
or a news media? In WWW, 2010.
[28] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin.
GraphChi: large-scale graph computation on just a
PC. In Proceedings of the 10th USENIX conference
on Operating Systems Design and Implementation,
pages 31–46. USENIX Association, 2012.
[29] Hang Liu and H. Howie Huang. Enterprise:
Breadth-First Graph Traversal on GPU Servers.
In International Conference for High Performance
Computing, Networking, Storage and Analysis
(SC), 2015.
[30] Hang Liu and H. Howie Huang. Graphene: Fine-
Grained IO Management for Graph Computing. In
Proceedings of the 15th USENIX Conference on
File and Storage Technologies. USENIX Associa-
tion, 2017.
[31] Hang Liu, H Howie Huang, and Yang Hu. iBFS:
Concurrent Breadth-First Search on GPUs. In Pro-
ceedings of the 2016 International Conference on
Management of Data (SIGMOD), 2016.
[32] Weifeng Liu and Brian Vinter. CSR5: An efficient
storage format for cross-platform sparse matrix-
vector multiplication. In Proceedings of the 29th
ACM on International Conference on Supercom-
puting, pages 339–350. ACM, 2015.
[33] Yucheng Low, Joseph Gonzalez, Aapo Kyrola,
Danny Bickson, Carlos Guestrin, and Joseph M
Hellerstein. Graphlab: A new framework for paral-
lel machine learning. 2010.
[34] Lijuan Luo, Martin Wong, and Wen-mei Hwu.
An effective GPU implementation of breadth-first
search. In Proceedings of the 47th design automa-
tion conference, pages 52–55. ACM, 2010.
[35] Steffen Maass, Changwoo Min, Sanidhya Kashyap,
Woonhak Kang, Mohan Kumar, and Taesoo Kim.
Mosaic: Processing a trillion-edge graph on a sin-
gle machine. In Proceedings of the Twelfth Euro-
pean Conference on Computer Systems, pages 527–
543. ACM, 2017.
[36] Sepideh Maleki, Annie Yang, and Martin
Burtscher. Higher-order and tuple-based
massively-parallel prefix sums, volume 51.
ACM, 2016.
[37] Grzegorz Malewicz, Matthew H Austern, Aart JC
Bik, James C Dehnert, Ilan Horn, Naty Leiser, and
Grzegorz Czajkowski. Pregel: a system for large-
scale graph processing. In Proceedings of the 2010
ACM SIGMOD International Conference on Man-
agement of data, pages 135–146. ACM, 2010.
14
[38] Duane Merrill, Michael Garland, and Andrew
Grimshaw. Scalable GPU graph traversal. In
PPoPP, 2012.
[39] Ulrich Meyer and Peter Sanders. ∆-Stepping: A
Parallel Single Source Shortest Path Algorithm. Al-
gorithmsESA98, 1998.
[40] Youshan Miao, Wentao Han, Kaiwei Li, Ming Wu,
Fan Yang, Lidong Zhou, Vijayan Prabhakaran, En-
hong Chen, and Wenguang Chen. Immortalgraph:
A system for storage and analysis of temporal
graphs. ACM Transactions on Storage (TOS), 2015.
[41] Alberto Montresor, Francesco De Pellegrini, and
Daniele Miorandi. Distributed k-Core Decompo-
sition. IEEE Transactions on Parallel and Dis-
tributed Systems, 2013.
[42] Donald Nguyen, Andrew Lenharth, and Keshav
Pingali. A lightweight infrastructure for graph
analytics. In Proceedings of the Twenty-Fourth
ACM Symposium on Operating Systems Principles
(SOSP), pages 456–471. ACM, 2013.
[43] Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhi-
jia Zhao. Tigr: Transforming Irregular Graphs
for GPU-Friendly Graph Processing. In Proceed-
ings of the Twenty-Third International Conference
on Architectural Support for Programming Lan-
guages and Operating Systems, pages 622–636.
ACM, 2018.
[44] Nvidia. NVIDIA Kepler GK110 Architecture
Whitepaper. 2013.
[45] Lawrence Page, Sergey Brin, Rajeev Motwani, and
Terry Winograd. The PageRank Citation Ranking:
Bringing Order to the Web. Technical report, Stan-
ford InfoLab, 1999.
[46] Adam Paszke, Sam Gross, Soumith Chintala, Gre-
gory Chanan, Edward Yang, Zachary DeVito, Zem-
ing Lin, Alban Desmaison, Luca Antiga, and Adam
Lerer. Automatic differentiation in PyTorch. 2017.
[47] Vijayan Prabhakaran, Ming Wu, Xuetian Weng,
Frank McSherry, Lidong Zhou, and Maya Hari-
dasan. Managing large graphs on multi-cores with
graph awareness. In Proceedings of USENIX con-
ference on Annual Technical Conference. USENIX
Association, 2012.
[48] Amitabha Roy, Laurent Bindschaedler, Jasmina
Malicevic, and Willy Zwaenepoel. Chaos: Scale-
out Graph Processing from Secondary Storage. In
Proceedings of the 25th Symposium on Operating
Systems Principles, pages 410–424. ACM, 2015.
[49] Amitabha Roy, Ivo Mihailovic, and Willy
Zwaenepoel. X-stream: Edge-centric graph pro-
cessing using streaming partitions. In Proceedings
of the Twenty-Fourth ACM Symposium on Oper-
ating Systems Principles, pages 472–488. ACM,
2013.
[50] Dipanjan Sengupta, Shuaiwen Leon Song, Kapil
Agarwal, and Karsten Schwan. GraphReduce:
processing large-scale graphs on accelerator-based
systems. In High Performance Computing,
Networking, Storage and Analysis, 2015 SC-
International Conference for, pages 1–12. IEEE,
2015.
[51] Bin Shao, Haixun Wang, and Yatao Li. Trinity: A
distributed graph engine on a memory cloud. In
Proceedings of International Conference on Man-
agement of Data (SIGMOD), pages 505–516, 2013.
[52] Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen,
and Feifei Li. Fast and Concurrent RDF Queries
with RDMA-Based Distributed Graph Exploration.
In 12th USENIX Symposium on Operating Sys-
tems Design and Implementation (OSDI) 16), pages
317–332.
[53] Julian Shun and Guy E Blelloch. Ligra: a
lightweight graph processing framework for shared
memory. In PPoPP, 2013.
[54] George M Slota, Sivasankaran Rajamanickam, and
Kamesh Madduri. BFS and Coloring-based Parallel
Algorithms for Strongly Connected Components
and Related Problems. In International Parallel
and Distributed Processing Symposium (IPDPS),
2014.
[55] SNAP: Stanford Large Network Dataset Collection.
http://snap.stanford.edu/data/.
[56] Tyler Sorensen, Alastair F Donaldson, Mark Batty,
Ganesh Gopalakrishnan, and Zvonimir Rakamaric´.
Portable inter-workgroup barrier synchronisation
for GPUs. In ACM SIGPLAN Notices, volume 51,
pages 39–58. ACM, 2016.
[57] The University of Florida: Sparse Matrix Collec-
tion. http://www.cise.ufl.edu/research/
sparse/matrices/.
[58] Yuanyuan Tian, Andrey Balmin, Severin Andreas
Corsten, Shirish Tatikonda, and John McPherson.
From Think Like a Vertex to Think Like a Graph.
Proceedings of the VLDB Endowment, 2013.
15
[59] Stanley Tzeng, Anjul Patney, and John D Owens.
Task Management for Irregular-Parallel Workloads
on the GPU. In Proceedings of the Conference on
High Performance Graphics. Eurographics Associ-
ation, 2010.
[60] Mohamed Wahib and Naoya Maruyama. Scalable
Kernel Fusion for Memory-bound GPU applica-
tions. In Proceedings of the International Confer-
ence for High Performance Computing, Network-
ing, Storage and Analysis. IEEE Press, 2014.
[61] Kai Wang and Zhendong Su. GraphQ: Graph
Query Processing with Abstraction Refinement-
Scalable and Programmable Analytics over Very
Large Graphs on a Single PC.
[62] Siyuan Wang, Chang Lou Lou, Rong Chen, and
Haibo Chen. Fast and Concurrent RDF Queries
using RDMA-assisted GPU Graph Exploration.
In 2018 USENIX Annual Technical Conference
(USENIX ATC 18), Boston, MA, 2018. USENIX
Association.
[63] Yangzihao Wang, Andrew Davidson, Yuechao Pan,
Yuduo Wu, Andy Riffel, and John D Owens. Gun-
rock: A high-performance graph processing library
on the GPU. In Proceedings of the 20th ACM
SIGPLAN Symposium on Principles and Practice
of Parallel Programming, pages 265–266. ACM,
2015.
[64] Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao,
Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai,
and Lidong Zhou. G ra M: scaling graph compu-
tation to the trillions. In Proceedings of the Sixth
ACM Symposium on Cloud Computing, pages 408–
421. ACM, 2015.
[65] Shucai Xiao and Wu-chun Feng. Inter-block GPU
communication via fast barrier synchronization. In
International Symposium on Parallel & Distributed
Processing (IPDPS), pages 1–12, 2010.
[66] Chenning Xie, Rong Chen, Haibing Guan, Binyu
Zang, and Haibo Chen. Sync or async: Time
to fuse for distributed graph-parallel computation.
In ACM SIGPLAN Notices (PPoPP), volume 50,
pages 194–204. ACM, 2015.
[67] Shengen Yan, Guoping Long, and Yunquan Zhang.
StreamScan: fast scan algorithms for GPUs without
global barrier synchronization. In PPoPP, 2013.
[68] Kaiyuan Zhang, Rong Chen, and Haibo Chen.
NUMA-aware graph-structured analytics. ACM
SIGPLAN Notices (PPoPP), 50(8):183–193, 2015.
[69] Mingxing Zhang, Yongwei Wu, Kang Chen, Xue-
hai Qian, Xue Li, and Weimin Zheng. Exploring the
Hidden Dimension in Graph Processing. In OSDI,
pages 285–300, 2016.
[70] Mingxing Zhang, Yongwei Wu, Youwei Zhuo,
Xuehai Qian, Chengying Huan, and Kang Chen.
Wonderland: A Novel Abstraction-Based Out-Of-
Core Graph Processing System. In Proceedings
of the Twenty-Third International Conference on
Architectural Support for Programming Languages
and Operating Systems, pages 608–621, 2018.
[71] Mingxing Zhang, Youwei Zhuo, Chao Wang,
Mingyu Gao, Yongwei Wu, Kang Chen, Christos
Kozyrakis, and Xuehai Qian. GraphP: Reducing
Communication for PIM-Based Graph Processing
with Efficient Data Partition. In High Performance
Computer Architecture (HPCA), 2018 IEEE Inter-
national Symposium on, pages 544–557, 2018.
[72] Yanfeng Zhang, Qixin Gao, Lixin Gao, and
Cuirong Wang. Maiter: An Asynchronous Graph
Processing Framework for Delta-based Accumula-
tive Iterative Computation. IEEE Transactions on
Parallel and Distributed Systems, 2014.
[73] Yunhao Zhang, Rong Chen, and Haibo Chen. Sub-
millisecond Stateful Stream Querying over Fast-
evolving Linked Data. In Proceedings of the
26th Symposium on Operating Systems Principles
(SOSP), pages 614–630. ACM, 2017.
[74] Yunming Zhang, Vladimir Kiriansky, Charith
Mendis, Saman Amarasinghe, and Matei Zaharia.
Making caches work for graph analytics. In 2017
IEEE International Conference on Big Data (Big
Data),, pages 293–302. IEEE, 2017.
[75] Da Zheng, Disa Mhembere, Randal Burns, Joshua
Vogelstein, Carey E Priebe, and Alexander S Sza-
lay. FlashGraph: processing billion-node graphs
on an array of commodity SSDs. In Proceedings of
the 13th USENIX Conference on File and Storage
Technologies, pages 45–58. USENIX Association,
2015.
[76] Jianlong Zhong and Bingsheng He. Medusa:
Simplified graph processing on gpus. Parallel
and Distributed Systems, IEEE Transactions on,
25(6):1543–1552, 2014.
[77] Xiaowei Zhu, Wentao Han, and Wenguang Chen.
GridGraph: Large-Scale Graph Processing on a
Single Machine Using 2-Level Hierarchical Parti-
tioning. In 2015 USENIX Annual Technical Confer-
ence (USENIX ATC 15), pages 375–386. USENIX
Association, 2015.
16
