GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms by Engelhardt, Nina & So, Hayden K. -H.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms
NINA ENGELHARDT and HAYDEN K.-H. SO, University of Hong Kong
Due to the irregular nature of connections in most graph datasets, partitioning graph analysis algorithms across multiple computational
nodes that do not share a common memory inevitably leads to large amounts of interconnect trac. Previous research has shown that
FPGAs can outcompete soware-based graph processing in shared memory contexts, but it remains an open question if this advantage
can be maintained in distributed systems.
In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators
for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework
automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to
present a familiar message-passing paradigm to the user, while under the hood implementing a more ecient architecture that can
reduce the necessary inter-FPGA network trac by a factor equal to the average degree of the input graph. A performance model
based on a theoretical analysis of the factors inuencing performance serves to evaluate the eciency of our implementation. With
a throughput of up to 5.8 GTEPS (billions of traversed edges per second) on a 4-FPGA system, the designs generated by GraVF-M
compare favorably to state-of-the-art frameworks from the literature and reach 94% of the projected performance limit of the system.
CCS Concepts: •Computer systems organization→ Recongurable computing; •Hardware→ Hardware accelerators; Re-
congurable logic applications;
Additional Key Words and Phrases: Vertex Centric, Graph Processing, Multi-FPGA Architecture, Performance Modelling, FPGA,
GraVF-M
ACM Reference format:
Nina Engelhardt and Hayden K.-H. So. 2019. GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms. ACM Trans.
Recong. Technol. Syst. 12, 4, Article ? (September 2019), 29 pages.
DOI:
1 INTRODUCTION
Graph data structures, which represent connections or relations between entities, are a natural t for many data analysis
applications. Popular uses range from determining social network inuence drivers to nding the shortest route across
a map of roads. However, graph structures, and the algorithms that work on them, have unique characteristics that
present challenges:
Low computational density. Most graph algorithms only execute a few operations per vertex or edge. Generally,
the performance of a single processing node strongly depends on the speed at which graph data can be accessed.
Irregular structure. Exacerbating the previous point, traversing a graph leads to random data access paerns
with very low locality, which signicantly degrades DRAM performance.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submied to ACM
ar
X
iv
:1
91
0.
07
40
8v
1 
 [c
s.D
C]
  1
4 O
ct 
20
19
2 Engelhardt, So
High degree of interconnection. When partitioning a graph among multiple nodes, many edges will cross
partitions, leading to high communication between nodes. While a good partitioning algorithm can mitigate
this aspect somewhat, the communication bandwidth required can quickly rival the memory bandwidth and
limit scaling of distributed graph algorithms.
FPGA-based accelerators can address these factors. eir on-chip BRAMs oer high-bandwidth low-latency random
access performance, and their programmable fabric can implement custom architectures that eciently use the available
resources. Past research includes several specialized FPGA implementations of individual algorithms demonstrating
the eciencies that can be gained by approaches such as using BRAM to cache most frequently accessed vertex data
(Zhang and Li 2018) or exploiting high-bandwidth o-chip memory interfaces (Aia et al. 2015).
However, FPGAs are notoriously challenging to program. In addition to the domain knowledge required to be able
to formulate the problem as an eective graph application, an FPGA implementation demands architecture and systems
prociency. Implementing a dedicated FPGA solution for a single, static application is therefore not practical in many
contexts. Graph processing frameworks can encapsulate all the architecture and systems optimizations, puing FPGA
systems within the reach of a domain expert with only basic hardware programming knowledge.
Research has explored approaches for more generic frameworks to accelerate a wider selection of graph algorithms.
ese leverage techniques from transactional memory (Ma et al. 2017), over custom so processors (Kapre 2015), to
CGRA overlays (Zhou et al. 2017). However, they mostly focus on single-FPGA accelerators, or on their integration
with the soware running on the host CPUs. Only a few works consider distributing a single computation across
multiple FPGA boards. e popularity of the Convey HC-1/HC-2 (comprising four FPGAs) at the beginning of the
decade prompted a number of works implementing some specic algorithm on this platform (Betkaoui et al. 2011;
Computer 2011); but these primarily used the high-bandwidth shared memory for communication. ere has been
some investigation into the issues for true distributed multi-FPGA frameworks (Dai et al. 2017; deLorimier et al. 2006),
but these have not yet been put to the test in an actual networked implementation.
It is an open question whether there is any performance advantage to be gained by scaling graph algorithms across
multiple FPGAs. Due to the highly interconnected nature of graphs, distributed graph processing requires large
communication volume between partitions. In our previous work, we have been investigating automatic generation of
graph processing FPGA designs, and how we can maximize performance while minimizing the required user input
(Engelhardt et al. 2018; Engelhardt and So 2016, 2017). A performance analysis revealed that this work is ecient on
single-FPGA systems but suers from communication inecency when using lower-bandwidth o-chip networks.
is work introduces GraVF-M, a variant of GraVF optimized for multi-FPGA platforms. It improves on all aspects
of the previous work by oering the following new contributions:
• A reorganized architecture that signicantly reduces communication over the inter-FPGA network. As network
bandwith is the critical factor limiting multi-FPGA performance on most platforms, this directly corresponds
to an equivalent speedup in total system performance.
• A three-stage programming model, which enables this optimization, as well as oering exibility to the user
and enabling simplied synchronization between supersteps.
• Beer load balancing among PEs and FPGAs through low-overhead partitioning heuristics.
• An updated performance model and analysis whose predictions closely match the observed results.
Organization of this paper: Section 2 situates our work in the context of previous eorts on FPGA-based graph
accelerators. Section 3 explains the programming model for the user kernels and section 4 details our framework’s
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 3
implementation. Section 5 describes how we model system performance. Section 6 evaluates a concrete implementation
of our system on a 4-FPGA platform and compares it to the predictions of our model and to other works. Finally, the
conclusion is in section 7.
2 RELATEDWORK
Previous research on graph processing on FPGAs can be broadly categorized into two groups: eorts exploring an
individual algorithm or platform, and eorts to build frameworks to more generally facilitate the implemenatation of
graph algorithms.
Implementations of specic algorithms abound (Aia et al. 2014; Betkaoui et al. 2012; Computer 2012; Dai et al. 2016;
Khoram et al. 2018; Lei et al. 2015; Umuroglu et al. 2015; Wang et al. 2010; Zhang et al. 2017; Zhang and Li 2018), but it
is rare that an application is at the same time so business-essential and yet its requirements so slowly changing as to
merit the several months of eort to build a dedicated hardware implementation.
Ours is not the rst framework proposed to ease the implementation of graph processing on FPGAs. We have
identied the following previous work introducing tools for FPGA-based graph processing not limited to a single
application:
2.1 Pure FPGA processing
TuNao (Zhou et al. 2017) is an ECGRA-based accelerator whose main optimization consists in storing the highest-degree
vertices on-chip while leaving less highly connected vertices in o-chip memory. e accelerator targets a 3-phase
Gather-Apply-Scaer execution model with a single data stream, parallelism is exploited by pipelining. e user kernel
for each phase is implemented by a separate ECGRA module.
(Ma et al. 2017) propose a transactional memory-based approach. eir model is GPU-like in that each processing
element hosts a large number of threads that are frequently stalled due to high-latency shared memory accesses,
however as the SIMD model is not eective for graph applications these processing elements will only process a single
work item (e.g. a vertex kernel) at a time. Communication between dierent vertices happens via shared transactional
memory, on the basis that most accesses will not conict.
(Zhou et al. 2016, 2018) argue that edge-centric graph processing is superior to the vertex-centric paradigm; they
build an edge-centric accelerator. e rst streams directly from o-chip memory, the second partitions the graph into
multiple sections which can be processed in parallel by multiple PEs. Both designs rely on on strict sorting of the edge
list for optimizations of the bandwidth needed to deliver messages.
2.2 FPGA processing with access to host memory
GraphGen (Nurvitadhi et al. 2014) is a graph accelerator built on the CoRAM memory interface, which gives the FPGA
access to the host’s main memory. e accelerator is a single processing element, that uses pipelining and SIMD
processing to extract parallelism from the application. is is sucient because of the low bandwidth connection to the
main memory, but for systems where the graph data can be accessed more eciently, the irregularity of the graph
interferes with SIMD approaches.
GraphSoC (Kapre 2015) is a custom so processor-based approach where a vertex-centric algorithm kernel is
expressed in 4 phases, and each kernel phase is implemented as a custom instruction. us the processors always
4 Engelhardt, So
execute the same program, but the eect of the instructions is changed. e processors exchange messages through an
on-chip network, which also serves to access graph data from the host’s main memory.
2.3 CPU-FPGA hybrid processing
GraphOps (Oguntebi and Olukotun 2016) proposes a library of dataow actors for graph processing. ey also propose a
novel data structure that directly stores the vertex data of neighbors rather than an edgelist (which is a list of references
to the neighbors). However, this requires the host to re-create this data structure for each iteration, and also increases
the size of the graph storage.
(Zhou and Prasanna 2017) proposes a hybrid accelerator that is able to switch between vertex-centric and edge-centric
execution depending on the current workload. e graph is partitioned in intervals and the CPU and FPGA both process
dierent intervals in parallel (the FPGA with greater throughput than the CPU).
ExtraV (Lee et al. 2017) considers the case where the edge data is too large to t into the host’s main memory and has
to be accessed from disk, but the vertex data is able to reside in DRAM. It uses the FPGA as an enhanced disk controller
with near-data processing capabilities. e FPGA can read the edge data from disk in a compressed format, decompress
and lter it on the y and send the resulting list of neighbors to the host. e host then performs all algorithm-specic
computation.
2.4 Multi-FPGA capable frameworks.
GraphStep (Delorimier et al. 2011; deLorimier et al. 2006) was an early proponent of using the high internal memory
bandwidth delivered by Block RAM to accelerate graph processing on FPGAs. ey also proposed connecting multiple
FPGAs together to increase the size of graph that can be processed. However, FPGA devices at the time oered even
lower amounts of embedded memory than today, so their approach of storing the entire graph structure in BRAM
resulted in (simulated) systems with hundreds of FPGAs for even modestly large graph datasets.
ForeGraph (Dai et al. 2017) is a multi-FPGA system targeting the Microso Catapult platform. It also consists of a
number of PEs that communicate via a network.
Both of these frameworks’ multi-FPGA performance is only evaluated in simulation. In contrast, our framework is
fully implemented and performance numbers are actually observed.
3 PROGRAMMING MODEL
Based on a short user description, the GraVF-M framework generates a hardware description of a graph application for
synthesis on platforms comprising one or more FPGAs as well as (optionally) o-chip memory.
e user input to the framework is a graph algorithm. e programming model used for this input has to fulll
double aims: to enhance productivity through ease of use, and to expose the algorithm features in a manner suited to
hardware implementation. We chose a synchronous vertex-centric programming model, and add some constraints
related to data accesses to ensure a uniform interface for the resulting hardware kernel. Compared to our previous
work (Engelhardt and So 2016), the main dierence is the addition of a gather stage, which aligns our programming
model with the familiar gather-apply-scaer paradigm, and makes it easier for the programmer to ascertain that the
mentioned constraints are respected.
In a synchronous vertex-centric programming model, the algorithm is formulated as a small function called a
vertex kernel. is user-provided kernel is run concurrently for each active vertex in the graph in what is called a
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 5
superstep. During a superstep, a vertex kernel only has access to limited data local to the vertex. Data is exchanged with
neighboring vertices through messages. Supersteps are separated by global barrier synchronization, and any messages
sent in superstep n will only be received in superstep n + 1.
is type of programming model is widely used as it is straightforward to understand, and any algorithm formulated
in this manner is inherently embarrassingly parallel. However, it presents two main challenges due to the global barrier:
e rst is that imbalance in workload leads to time wasted waiting for stragglers to reach the barrier. is is mainly
addressed through implementation choices that reduce imbalance and further mitigated by the use of a oating barrier.
As these do not require modications to the programming model, they will be discussed in the architecture section.
e second challenge is that all messages generated in a superstep need to be stored until the next superstep. Memory
resources are very limited on FPGA environments: typically, there will be less than 40MB on-chip BRAM and some
4-16GB o-chip DDR available.
To ensure correct and deadlock-free completion of algorithms, given the xed-size buers available, the quantity of
message data generated in a superstep needs to be strictly bounded. We design the GraVF-M programming model so as
to intrinsically enforce these limits, by asking the user to split their kernel into three functions (hardware modules)
with xed I/O interfaces.
Gather. is function is called once for each message received by a vertex. It updates the vertex state based on the
contents of the message.
Apply. is function is called once for each vertex at the end of a superstep, aer all messages have been received.
Based on the nal vertex state, it can issue an update to be communicated to the vertex’s neighbors.
Scaer. is function is called once for every outgoing neighbor. Based on the update issued by apply and, if dened,
the edge data (weight), it nalizes the message to be sent to the neighbor.
Because the apply step is only called once for each vertex per superstep, the number of updates per superstep cannot
exceed the number of vertices. Rather than aempt to store the much greater number of messages, we store these
updates and only call the scaer function to produce the actual messages on demand when resources are available during
execution of the next superstep. e messages are then generated, transported to their destination, and immediately
consumed by the gather stage of the receiving vertex without risk of deadlock.
As an example, we show how the weakly connected components (WCC) algorithm is implemented in GraVF-M.
is algorithm detects disjunct areas of a graph (sets of vertices such that all edges originating from a vertex in the set
connect to a vertex also in the same set). ese sets are also known as weakly connected components (to dierentiate
from strongly connected components, another name for cliques, where every vertex in the set is connected to every
other vertex in the set). Each set agrees on a common label by propagating the lowest seen vertex ID to all neighbors.
e GraVF-M framework itself is wrien in Migen (M-Labs 2012), a Python-based hardware description language
that exports to Verilog for synthesis with conventional FPGA vendor tools such as Vivado from Xilinx. e native
way to implement the algorithm kernels is also in Migen, but glue code is provided to insert any netlist that the FPGA
vendor’s tool accepts, thus the user is free to chose their favorite hardware description language.
In a rst step, the user needs to dene the algorithm-dependent data structures. ese are dened in Migen/Python,
as a list of eld names and their bit widths. Predened values related to system conguration, such as the size of
a vertex identier vertexidsize, can be used. e user can also dene their own identiers, whose values have to be
provided in the conguration le at build. e following data structures need to be dened:
6 Engelhardt, So
(1) Vertex state data. For WCC, each vertex saves the smallest vertex ID found so far. e eld active indicates if a
shorter path was found in this superstep that should be broadcast to neighbors.
node_storage_layout = [
("label", "vertexidsize "),
(" active", 1)
]
(2) Edge data (optional). WCC does not consider edge weights, so this denition is omied.
(3) Update payload. When a smaller vertex ID is found, the information that the apply kernel needs to pass to the
scaer kernel is the new label.
update_layout = [
("label", "vertexidsize ")
]
(4) Message payload. e information sent by the scaer kernel to each neighbor is the new, smaller vertex ID
encountered.
message_layout = [
("label", "vertexidsize ")
]
It frequently happens, as here, that the update payload and the message payload share the same format. In
this case it is not necessary to write out the full denition again. ese are simple Python variable denitions,
so previously dened variables can be used:
message_layout = update_layout
In a second step, the three kernel modules Gather, Apply, and Scaer have to be dened, this time in the hardware
description language of the user’s choice. is example will use Verilog kernels, as more readers are likely to be familiar
with this language than with Migen.
e data formats dened in the previous step are used to set the input and output signals of each module. All modules
may be pipelined as deeply as the programmer wishes or have variable latency; they use handshake signals for exible
ow control.
Gather. e gather kernel module receives as inputs a message from a neighbor that discovered a new label in the
last superstep, and the destination vertex’s state. e following inputs are dened:
level_in current superstep
nodeid_in active vertex’s ID
sender_in neighbor that sent the received message
message_in received message (using message_layout)
state_in active vertex’s current state data (using node_storage_layout)
Flow control is handled by the associated handshake signals valid_in and ready.
e module should generate the following outputs:
nodeid_out active vertex’s ID
state_out active vertex’s new state data
with the associated handshake signals state_valid and state_ack.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 7
wire new_label;
assign new_label = state_in_label > message_in_label;
assign nodeid_out = nodeid_in;
assign state_out_label = new_label ? message_in_label : state_in_label;
assign state_out_active = new_label ? 1'b1 : state_in_active;
assign state_valid = valid_in;
assign ready = state_ack;
Listing 1. WCC gather implementation
Listing 1 shows the implementation of the WCC gather kernel module. e amount of logic needed is very small, so
it can be implemented using combinatorial logic only. A single comparison is needed: if the label from the message is
smaller than the currently found smallest label, the label is updated in the vertex state and the vertex is marked active.
Otherwise, the state is not modied.
Apply. e apply kernel module processes each vertex’s state at the end of the superstep, and initiates a broadcast of
the new label if a smaller one was found this superstep. e following inputs are dened:
nodeid_in active vertex
state_in active vertex’s current state data (using node_storage_layout)
round_in channel to use for messages sent in this superstep
barrier_in barrier marker (asserted aer the last active vertex in the superstep)
Flow control is handled by the associated handshake signals valid_in and ready.
is module should generate two sets of outputs:
nodeid_out active vertex
state_out active vertex’s new state data (using node_storage_layout)
state_barrier barrier marker (indicates all state writebacks of a superstep are completed)
state_valid signals current output should be wrien back (no handshake)
ere is no handshake for this output. To avoid the added complexity of handling two sources of backpressure, it is
guaranteed that the state can be wrien back at any time.
update_out update to be broadcast (using update_layout)
update_sender vertex that generated the update
update_round channel to use for sending messages generated from the update
barrier_out barrier marker (separates updates of dierent supersteps)
Flow control is handled by the associated handshake signals update_valid and update_ack.
Listing 2 shows the implementation of the WCC apply kernel module. Again, the logic is minimal: If the vertex has
been marked active in the gather step, meaning that a smaller label has been found in this superstep, then an update is
issued by asserting update_valid. All vertices also have their active bit reset to zero in preparation for the next superstep.
assign nodeid_out = nodeid_in;
assign state_out_label = state_in_label;
assign state_out_active = 1'b0;
assign state_barrier = barrier_in;
assign state_valid = valid_in;
assign update_out_label = state_in_label;
8 Engelhardt, So
assign update_sender = nodeid_in;
assign update_round = round_in;
assign update_valid = valid_in & state_in_active;
assign barrier_out = barrier_in;
assign ready = update_ack;
Listing 2. WCC apply implementation
Scaer. e scaer kernel module processes the outgoing edges of a vertex that issued an update in the apply phase.
For each edge, it propagates the label from the update.
e following inputs correspond directly to the update outputs from the apply kernel module (the same update is
repeated multiple times for the dierent outgoing edges):
update_in update to be broadcast (using update_layout)
sender_in vertex that generated the update
round_in channel to use for sending messages generated from the update
barrier_in barrier marker (separates updates of dierent supersteps)
Furthermore, the following inputs relate to each edge:
neighbor_in destination of the current edge
num_neighbors_in out-degree of the sending node
Input handshake is accomplished by the signals valid_in and ready.
e module outputs messages, using the following signals:
message_out message (using message_layout)
neighbor_out destination vertex
sender_out sending vertex
round_out channel to use for sending the message
barrier_out barrier marker (separates messages of dierent supersteps)
Output handshake signals are valid_out and message_ack.
Listing 3 shows the scaer kernel module for WCC. All values are forwarded as-is. While it is not necessary for
these small amounts of logic, we insert a pipeline stage to demonstrate how a pipelined module would be implemented.
assign ready = message_ack;
always @(posedge sys_clk) begin
if (message_ack) begin
message_out_label <= update_in_label;
neighbor_out <= neighbor_in;
sender_out <= sender_in;
round_out <= round_in;
valid_out <= valid_in;
barrier_out <= barrier_in;
end
end
Listing 3. WCC scaer implementation
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 9
Cumulatively, the whole hardware description of the WCC algorithm is implemented using approximately 30 lines
of code. is is typical of most kernels, as long as they only use integer arithmetic. An algorithm with multiple
oating-point operations, such as e.g. PageRank, requires about 200 lines of code to implement the pipelining necessary
for ecient processing.
Type of algorithms supported. Not all graph algorithms can be implemented in the Gather-Apply-Scaer programming
model chosen. An interesting way to view the restrictions is by considering the matrix representation of graphs: Instead
of representing a graph as a collection of vertices and edges, it is also possible to represent a graph as an adjacency
matrix of size |V | × |V |, where the elementAi j = 1 if the vertices i and j are connected by an edge, andAi j = 0 otherwise.
e data accesses performed in this Gather-Apply-Scaer model are the same as for multiplying the vector of vertex
states with the adjacency matrix, and saving the result as the new vertex state.
In terms of the equivalent matrix representation of graphs, the algorithms that can be implemented are of matrix-
vector type: throughout all supersteps, the resulting vector of vertex data remains of constant size. Operations of type
matrix-matrix, or in graph terms operations that build lists of multi-hop paths, are deliberately excluded since they have
dynamically increasing storage requirements (the result of multiplying a sparse matrix with itself has more non-zero
values than the original). Although there is a theoretical transformation that allows any vertex-centric algorithm to be
turned into a 3-phase Gather-Apply-Scaer kernel, these transformations rely on non-bounded data structures (e.g.,
keeping all messages received in the Gather stage saved in the vertex data, and only using them in the Apply stage).
ese cannot be realistically applied on FPGA, which has much more limited and less exible memory resources.
4 ARCHITECTURE
One of the advantages of using a vertex-centric model is that large amounts of parallelism are readily available. To
make use of it, the GraVF-M framework generates as many processing elements (PEs) as can t the available resources.
e vertices of the graph are then partitioned among the PEs. e assigned PE of a vertex hosts the vertex’s data and
runs the vertex kernel for this vertex. Communication between vertices involves transfer of data between dierent PEs.
An on-chip network, in our case realized as a crossbar, transports data between PEs on the same FPGA. On multi-FPGA
systems, data destined for remote PEs is handed to a network endpoint that delivers it via o-chip network to the
endpoint on the destination FPGA. Fig. 1 shows a high-level overview of the complete system.
4.1 Optimized message delivery for multi-FPGA systems
When implementing a message-passing graph processing system, the amount of messages to be delivered is a concern
in two ways:
• Storage space. Because computation is structured into supersteps, all messages sent during a superstep have to
be stored before they can be delivered to the next superstep. e available memory space puts an upper limit on
the amount of messages that can be sent during a superstep, which generally correlates with the dataset size.
• Communication bandwidth. Graphs are highly interconnected, usually in an unstructured manner, which
means that partitioning the graph between multiple processing elements or across multiple FPGAs will lead to
large amounts of messages being sent between partitions.
To address both issues, we introduce an optimization in GraVF-M which makes use of the observation that during the
scaer phase, one update produces many messages. We can save both memory space and bandwidth by waiting until
10 Engelhardt, So
FPGA 1
...
... ... ...
...
PE PE
PE PE PE
PE
PE
...
...
PE Endpoint 
Network
FPGA n/2
...
... ... ...
...
PE PE
PE PE PE
PE
PE
...
...
End
point PE
PE
FPGA n/2 + 1
...
... ... ...
...
PE PE
PE PE
PE
PE
...
...
PE
End
point 
PE
FPGA n
...
... ... ...
...
PE PE
PE PE
PE
PE
...
...
End
point 
PE
...
...
Fig. 1. System overview
the last possible moment to call the scaer function, which means producing messages on demand to be immediately
consumed by the gather module.
From the user’s perspective, the programming model is unchanged: they write the program as if messages were being
exchanged by distinct supersteps. In the underlying hardware generated by the framework, however, the execution of
the scaer stage is delayed until the following superstep, and it is located in the receiving PE.
e sending PE stores all updates produced by the apply stage during a superstep in a large output queue. Because
the programming model is dened such that at most one update is allowed to be issued by each vertex in a superstep,
the size of this queue is equal to the number of vertices which is generally at least an order of magnitude smaller than
the number of messages, of which there can be as many as there are outgoing edges. Rather than creating individual
messages and transporting them to their destination, the update is broadcast to all PEs, which then execute the scaer
stage for the subset of edges that have local destinations.
4.2 Processing Elements
e processing element is responsible for organizing the execution of the vertex kernels according to the described
programming model. is means arranging the ow of messages, retrieving the associated vertex and edge data, and
managing the update queue, which is the principal storage in the system.
Like the algorithm, the PEs are split into three modules for the three phases gather, apply and scaer. Fig. 2 gives an
overview of the PE’s internal pipeline, with the user-provided kernel modules shaded.
4.2.1 Scaer. Incoming updates from the previous superstep enter the scaer module (Fig. 3). For each update, the
scaer module rst looks up the starting address and length of the sender’s local portion of the edge list, then iterates
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 11
Vertex StorageEdge Storage
Apply
Apply
Kernel
Gather
Kernel
Gather
Network Network
update
queue
Scatter
Kernel
Scatter
Fig. 2. Multi-FPGA optimized processing element overview
through the local neighbors and presents the edges one by one to the scaer kernel along with the update. Edgelist
storage can be local (in BRAM) or in o-chip memory, such as HMC. e last update of a superstep is followed by a
special barrier update, which is passed through the scaer kernel to mark the end of the messages.
Scatter
Kernel
ed
ge
lis
t s
to
ra
ge edgeindex
update update
update
in
de
x 
st
or
ag
e
message
Fig. 3. Scaer module
Edge storage modications. Delaying execution of the scaer stage necessitates changing how the edge data is stored.
Rather than storing the list of all neighbors for the subset of vertices assigned to this PE, it now stores for all vertices
the subset of neighbors assigned to this PE. e overall volume of edges stored is the same, but they are moved to a
dierent location. Fig. 4 illustrates where edges are stored in the original (GraVF) and the new (GraVF-M) scheme for
the case of 4 PEs, with each color representing the partition of edges stored in the same scaer module.
0 1
0
1
origin
location
destination 
location 
2 3
2
3
PE 0 1
0
1
2 3
2
3
destination 
location 
origin
location
PE
Fig. 4. Edge list partitioning in GraVF (le) and GraVF-M (right)
12 Engelhardt, So
4.2.2 Gather. Messages produced by the scaer module enter the gather module, whose internals are shown in Fig.
5. e module reads the destination vertex’s data, and passes it together with the message to the user-dened gather
kernel. A read-aer-write hazard may occur if another message for the same vertex arrives in quick succession, and
the destination vertex’s data has not yet been wrien back. We use a hazard detection module to keep track of in-use
vertex data and stall the read stage until the correct data has been wrien back by the gather kernel.
Gather
Kernelre
ad
Vertex storage
w
rit
e
Network
message
hazard
message
vertex data
vertex data
Fig. 5. Gather module
4.2.3 Apply. When a barrier message enters the gather stage, this signals that all messages for the current superstep
have been received. e ready signal from the gather module is deasserted to temporarily halt further transmission of
messages from the next superstep that might potentially already be waiting. e gather module is rst ushed to ensure
that all vertex data is wrien back, then access to the vertex storage is switched to the apply module (Fig. 6). e apply
module iterates through all vertices assigned to the PE, passing them sequentially to the user-dened apply kernel. e
updates generated by the apply kernel are entered into a large update queue. Once all modications to vertex data are
wrien back, a barrier update is wrien to the queue to separate updates of this superstep from those of the next, and
control of the vertex data is switched back to the gather module, which once again begins accepting messages.
Apply
Kernelre
ad
Vertex storage
w
rit
evertex data vertex data
update
Fig. 6. Apply module
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 13
4.3 Network and Synchronization
e role of the network is not only to transport data between the PEs, but also to enforce separation between the
supersteps. e programming model species a barrier between supersteps: all messages from one superstep should
nish processing before the rst message from the next superstep is processed. Given the deep pipelines necessary
to make FPGA-based processing ecient and the distributed nature of the system with many processing elements
advancing at dierent rates, enforcing such a strict temporal separation between supersteps would waste many cycles
ushing the complete system at each superstep. Instead, we propose the use of a oating barrier to preserve the
invariant, from the user’s perspective, that they will always receive all messages from one superstep before the rst
message from the next – but dierent pipeline steps and dierent PEs may proceed to the next superstep at dierent
times.
Floating Barrier. To achieve exible distributed barrier synchronization, barrier messages are exchanged between
PEs. Initially, the barrier is injected into the apply modules to begin execution. is causes each apply module to call
the apply kernel on the initial state of its assigned vertices, and then follow with the barrier. e barrier progresses
through the pipeline behind the initial updates, terminating the data for this superstep. As soon as the apply module
nishes writing back the modied vertex data of the last vertex, it switches operations back to the gather module,
which is now ready to accept any messages for the next superstep. ese are likely to be already waiting, as the earlier
updates have had time to progress through the scaer module and cause messages to be generated. While the barrier
remains in the update queue, the gather module is in eect processing a superstep ahead of the scaer module.
roughout the PE pipeline, strict ordering is enforced, with the barrier following behind the data for the current
superstep. When the barrier reaches the end of the apply module and exits the PE, the network sends a copy of the
barrier to each PE individually, and includes the count of updates that was sent during this superstep to the destination
PE in the barrier. is count allows the destination to conrm it has received all updates for this superstep, which
is necessary to support networks which do not guarantee in-order delivery (e.g. packet-switched o-chip networks
like Ethernet). If the barrier message were to be reordered and arrive before other messages, the destination network
endpoints would wait for the remaining messages.
On the receiving side, each PE’s network endpoint passes on updates to the gather module, and waits to receive a
barrier from each PE in the system (including itself). Upon reception of all barriers and comparison of received and
expected update counts, the network endpoint of the PE is able to independently determine that it has received all
updates for this superstep. At this point the barrier is passed on to the PE, which can progress to the next superstep.
Detecting algorithm termination is also achieved through the barrier messages. e message includes a bit indicating
whether the sending PE has sent any updates during the last superstep. If none of the PEs have generated any updates,
the algorithm is deemed inactive and the receiving network endpoint does not pass the barrier on but raises a termination
signal instead. is way all PEs are able to detect termination independently, in true distributed fashion.
Network properties. For the oating barrier algorithm to work properly and not cause deadlock, the network needs
to fulll two conditions: it needs to guarantee delivery (i.e. not lose any updates), and it needs to guarantee that
backpressure on updates of one superstep will not impede delivery of updates from any previous superstep.
e rst condition is straightforward: if updates are lost in transit, the algorithm execution is no longer correct.
Deadlock would also occur because the receiving PE would wait indenitely for the updates to arrive.
14 Engelhardt, So
e second condition also serves to avoid deadlock. Because the oating barrier allows PEs to progress to the next
superstep at dierent times, updates from dierent supersteps may coexist. Buering capacity in the network is very
limited. erefore, if some PE advances to the next superstep and starts sending many messages before any of the other
PEs are ready to accept them, these messages would ll all available buers in the network. If this impedes delivery of
any messages from the previous superstep that the other PEs are still waiting for, they will never be able to progress to
the next superstep. However, the messages from the new superstep clogging the buers cannot be consumed until aer
the other PEs switch to the next superstep. To ensure this deadlock situation cannot occur, the network must not allow
messages from a later superstep to interfere with delivery of earlier messages.
e easiest way to implement this would be to only admit messages of one superstep into the network at a time, and
prohibit sending messages of the next superstep until all messages of the previous superstep are delivered. However this
reduces the benet of the oating barrier, as it means that a true temporal barrier is introduced and variable workload
imbalances between PEs can no longer be averaged out across supersteps. For beer performance, the network should
oer separate (physical or virtual) channels for messages of dierent supersteps.
We implement internal communication within the FPGA through a crossbar network. e receiving side indicates
which channel it is currently accepting messages from, and the arbiter selects a sender among PEs wishing to send on
this channel.
For inter-FPGA communication, we sequentialize the virtual channels. e endpoint for outgoing messages in the
crossbar only accepts messages from one superstep at a time, and switches to the next channel when all PEs within the
FPGA have completed the superstep.
Filtering delivery of updates to FPGAs with neighbors. In comparison to GraVF, the main change to the network is
that instead of unicast messages, it now transports broadcast updates that have to be delivered to all PEs in the system.
e local network is modied in such a manner that updates sent from a local PE are broadcast to all local PEs and to
each remote FPGA, whereas updates coming from remote FPGAs are broadcast only to local PEs. Updates, just like
messsages, use channels to separate supersteps and are counted to ensure correct synchronization. Barriers work the
same as previously to detect superstep switchover and algorithm termination.
Broadcasting an update to all PEs is only advantageous if an update, on average, produces more messages than there
are copies made of the update. If most vertices have only one neighbor, sending a copy of the update to each other
FPGA in the system will use more inter-FPGA bandwidth than a single message would. To ensure that this optimization
will always result in a net benet, we add a ltering unit that looks up whether the sender of an update has neighbors
on a remote FPGA, and does not send it to FPGAs where there are no potential recipients. is is currently realized
through a bitmap of size |V | ×nFPGA (listing for each vertex which FPGAs host neighbors) that is generated at the same
time as the graph is partitioned across FPGAs. For larger systems, the storage required for the lter could be reduced
through use of compression or heuristic methods such as bloom lters.
4.4 Partitioning
How the vertices are distributed among the PEs aects the workload balance. Beer partitioning improves processing
speed, however the pre-processing overhead can quickly overwhelm these gains. e least-eort way to distribute
vertices between the PEs is by round-robin, which results in a balanced number of vertices per PE. However, as the
workload consists primarily of traversing edges, the cumulative number of outgoing edges from the vertices assigned
to a PE would be a beer proxy for workload balance.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 15
Balancing the number of edges per PE is an instance of the multi-way number partitioning problem and thus
theoretically NP-complete, however this problem is known to be solved well by even the simplest of heuristics if the
number of partitions is much smaller than the number of items to be partitioned (Korf 2009). In fact, we nd that we
achieve near-perfect partitioning by assigning vertices to the partition with the currently lowest edge count even if we
omit the preliminary step of sorting the vertices by degree. is is presumably helped by the fact that the vertices are
already somewhat sorted due to the way the graph is constructed when reading in the edge list: vertices are added as
they are encountered, and vertices with many edges appear many times and thus have a higher likelihood of being
encountered early.
When distributing vertices within a single FPGA, there are no locality benets: the communication costs are identical
between any two PEs, as even a message sent to another vertex on the same PE has to pass through the network so as
to respect the superstep synchronization. However, once we introduce multiple FPGAs, there is a signicant dierence
in available bandwidth for communicating with vertices located on remote FPGAs compared to local vertices.
We add the option to partition the graph between the FPGAs using METIS (Karypis and Kumar 1998). While the
runtime of this high-quality partitioning method is several orders of magnitude larger than that of our benchmarks, it
might still be worthwhile in specic circumstances (e.g. when the same dataset is reused many times), and it allows us
to explore some of the tradeos in partitioning.
As before, we use the outdegree of vertices as weight for balancing. METIS allows us to tune a balance factor ufactor,
which constrains the allowed imbalance of the resulting partition to be no greater than ufactor/1000. e default value
of ufactor is 1 (0.1% of imbalance allowed), and we observe that for RMAT graphs, the quality of the partition only
improves for values greater than 500. As the performance degradation from imbalance is much greater than the gain
from reducing inter-FPGA communication, we leave the value of ufactor at 1.
Dataset size limitations. So far, data is only partitioned in space, among the dierent PEs. We have not implemented
partitioning in time, where dierent parts of the graph are loaded for processing one aer the other by the same PE.
e size of the datasets that GraVF-M can handle is consequently limited by the need to t the entire graph in the
resources provided by the FPGAs available.
However, GraVF would be a good t for grid-based partitioning, as is used e.g. in the soware framework Grid-
Graph (Zhu et al. 2015) or in the other framework using multiple FPGAs, ForeGraph (Dai et al. 2017). In this approach,
the edge lists are partitioned both by origin and by destination vertex, so that the scaer stage can generate messages
separately for each block of vertices. is allows the gather stage to load the relevant vertices and process all messages
destined for this block, and then proceed to load the next block.
5 THEORETICAL PERFORMANCE ANALYSIS
We build a throughput-oriented performance model of the system based on an analysis of the various limiting factors.
is model serves a dual purpose, both to project expected performance of a system before proceeding to acquire it and
implement the algorithm, and for use by the framework during generation to adjust the system parameters to their
optimal values.
16 Engelhardt, So
5.1 Objective
e goal of this performance model is to maximise the overall system throughput (Tsys) by determining the values
of the following two variables: the number of FPGAs (nFPGA) and the number of PEs per FPGA (nPE/FPGA). System
throughput is measured in traversed edges per second (TEPS), a common measure of performance for graph processing.
5.2 Parameters
Variables
nFPGA number of FPGAs to use
nPE/FPGA number of PEs per FPGA
Dependent Variables
TFPGA throughput per FPGA (edges/s)
Tsys total system throughput (edges/s)
System Parameters
CPEPE cycles per edge (edges−1)
fclk operating frequency (Hz)
BWif network interface bandwidth (bits/s)
BWnetwork total network bandwidth (bits/s)
BWmem memory interface bandwidth (bits/s)
Mboard memory capacity per FPGA (bits)
Algorithm-dependent Parameters
mvertex data storage per vertex (bits)
mmessage message size (bits/edge)
mupdate update size (bits)
medge edge size (bits)
pmsg/TE messages per traversed edge (edges−1)
Dataset-dependent Parameters
|V | number of vertices in the input graph
|E | number of edges in the input graph
Table 1. Definitions
e domains for the two variables nFPGA and nPE/FPGA are determined by the platform conguration. When including
support for a given FPGA board in the framework, we specify the available resources as part of the board support
package. e domain for the number of PEs per FPGA (nPE/FPGA) is xed based on the size of the recongurable fabric.
e computational demands of graph algorithms are generally low, so that the area demands for the user-provided
application kernels can be estimated generously. e upper limit is straightforwardly determined by recongurable
logic element usage per PE. For larger FPGAs, a lower limit greater than one PE may also be imposed to avoid routing
problems: as we strive to use all available BRAM to store vertex data, it becomes very hard for the synthesis tools to
route connections from all corners of the FPGA chip to a single PE.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 17
e minimum number of FPGA boards is determined by the problem size: each board can host a number of vertices
limited by the available memory and the size of the vertex data specied by the algorithm. e number of FPGAs in the
system has to be big enough to host the whole dataset:
nFPGA ≥ |V | ×mvertex
Mboard
Given these parameters, the system evaluates which of the following factors limits the overall performance:
5.3 Processing element throughput constraint.
Each processing element can traverse a limited number of edges per second. Analogous to a processor’s CPI (cycles
per instruction), we dene the processing element’s CPE (cycles per edge) as the average number of cycles it takes to
traverse an edge. It is determined by a combination of architectural features, input graph features and message paerns:
e PE pipeline is for the most part designed to handle a new edge each cycle, but the gather module can stall due to
read-aer-write hazards if two messages for the same vertex are received in quick succession, and the scaer module
needs a cycle to reset at the end of each edge list. e user kernel may also inadvertently lower CPE if an inecient
implementation is chosen, although this is not the case for our kernel implementations, which all support a CPE of 1.
Experimentally, we determine the CPE of our PEs to vary between 1.05 and 1.4.
e cumulative throughput limit LPE of all PEs in the system is expressed in the following formula:
Tsys ≤ LPE = nFPGA × nPE/FPGA × fclkCPEPE (1)
For the maximum number of PEs that ts in an FPGA, we obtain the computational limit of the FPGA, LPEmax .
5.4 Memory bandwidth constraint.
When using o-chip memory to store the adjacency lists, the FPGA throughput TFPGA may also be limited by the
memory bandwidth. Traversing one edge requires loading one edge of sizemedge from memory:
TFPGA ×medge ≤ BWmem
From which, by multiplying with the number of FPGAs in the system, we derive the memory interface limit Lmem :
Tsys ≤ Lmem = nFPGA × BWmem
medge
(2)
e above equation does not account for memory access granularity. While the HMC platform used in our experiments
allows accessing memory in comparatively small words of 128 bit, leading to very lile wasted bandwidth, this can
become important e.g. when using AXI-based memory interfaces, which read data in words of 512 bit at a time. e
eect is also more pronounced for GraVF-M than for GraVF, as the edgelists are split into nPE arrays, as discussed in
section 4.2.1 and shown in Fig. 4.
We can estimate the worst case wasted bandwidth if we assume that there is no correlation between the length of
the edgelist and the probability of it being accessed (i.e. vertices are equally likely to be active irrespective of their
degree) as follows.
18 Engelhardt, So
ere are |V | × nPE edgelists, containing |E | edges in total. Each edgelist is stored as a sequential array of edges and
accessed as a whole. e start of the edgelist is aligned to the memory access word boundaries. In the worst case, the
last memory access for every edgelist contains only a single edge. us, to read all |E | edges in the graph, on top of the
useful |E | ×medge bits of edgelist, an additional |V | × nPE × (mmemword −medge) bits are read from memory (where
mmemword denotes the number of bits read in one memory access). In per-traversed-edge terms, this corresponds to:
TFPGA ×
(
medge +
|V |
|E | × nPE × (mmemword −medge)
)
≤ BWmem
From which we can derive the rened limit accounting for access granularity:
Tsys ≤ nFPGA × BWmem(
medge +
|V |
|E | × nPE × (mmemword −medge)
)
e additional term can grow to be a problem if the access granularity is very large, if the average degree of the
graph is low, or if the number of PEs in the system is very high. In the laer case there is however a limit: e edge list
cannot be spread thinner than one edge per PE (if a vertex has zero neighbors on a PE, no memory request is issued). If
the term |V ||E | × nPE reaches 1, the formula reduces to Tsys ≤ nFPGA ×
BWmem
mmemword : a whole memory word has to be read
for each edge.
5.5 Network interface bandwidth constraint.
e PEs have to exchange messages among each other.
With the GraVF-M optimization, the amount of data transferred is reduced. Instead of one message per neighbor,
one update is sent to each FPGA for all neighbors combined. On average, a vertex has |E ||V | neighbors. We assume that
the network does not support a true broadcast operation, so each update has to be sent individually to all nFPGA − 1
remote FPGAs. We also assume that the sending and receiving bandwidths are symmetrical, i.e. both equal to half the
reported total interface bandwidth BWif2 .
e average amount of data sent over the network per edge ismupdate × (nFPGA − 1) × |V ||E | . TFPGA is the amount of
edges processed per second on the sending FPGA. e limitation that the network interface of a given FPGA imposes
on the performance is thus expressed by the inequality
TFPGA ×mupdate × (nFPGA − 1) ×
|V |
|E | ≤
BWif
2
Combining this with Tsys = TFPGA × nFPGA and solving for Tsys results in:
Tsys ≤ Li f =
BWif
2 ×mupdate
× nFPGA
nFPGA − 1 ×
|E |
|V | (3)
For GraVF, which sends unicast messages rather than broadcasting updates, the network interface limit has been
calculated in (Engelhardt et al. 2018) to be
Li f (GraVF) =
BWif
2 ×mmessage ×
n2FPGA
nFPGA − 1 (4)
us, the theoretical speedup of GraVF-M over GraVF is:
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 19
Speedup =
Li f (GraVF-M)
Li f (GraVF)
=
|E |
|V | ×
1
nFPGA
× mupdate
mmessage
(5)
Typically, |E | is larger than |V | by one to two orders of magnitude. However, in for very large systems or extremely
sparse graphs where the term |E |/(|V | × nFPGA) is smaller than 1, the optional ltering step described in section 4.3
that sends updates only to FPGAs where neighbors reside ensures that the performance will not be reduced below that
of GraVF. With our current implementation, mupdatemmessage = 1 so this term does not aect performance.
5.6 Total network bandwidth constraint.
e overall bandwidth of the network connecting the FPGAs is also limited. e cumulative data sent by all the FPGA
boards cannot exceed it. As before, the average amount of data sent over the network per edge ismupdate × (nFPGA −
1) × |V ||E | . e total amount of edges processed per second in the whole system is Tsys.
e limitation introduced by the total network bandwidth can thus be expressed in the inequality
Tsys ×mupdate × (nFPGA − 1) ×
|V |
|E | ≤ BWnetwork
Rearranging this equation give the network limit Lnet :
Tsys ≤ Lnet = BWnetwork(nFPGA − 1) ×mupdate
× |E ||V | (6)
Comparing this to the limits of GraVF found in (Engelhardt et al. 2018):
Tsys ≤ Lnetwork =
BWnetwork × nFPGA
(nFPGA − 1) ×mmessage (7)
As the same reduction in amount of data crossing the network applies, we once again obtain the same speedup as for
the network interface limit.
Speedup =
Li f (GraVF-M)
Li f (GraVF)
=
|E |
|V | ×
1
nFPGA
× mupdate
mmessage
(8)
Other network topologies. In these two network limits, only a simple network consisting of direct one-to-one
communication between FPGAs is considered, as this corresponds to the platforms on which GraVF(-M) has been
implemented so far. However, other topologies exist.
e reasoning on network and network interface constraints can be extended to hierarchical networks with more
layers. is is commonly the case where e.g. multiple FPGAs are hosted on the same node and can communicate quickly
with each other, but there is less bandwidth available to communicate with FPGAs on remote nodes. In this situation,
the same equations apply, with the number of nodes in place of the number of FPGAs.
More generally, the techniques of network analysis can be applied by considering each FPGA as a source of
TFPGA ×mupdate × |V ||E | bits/s of data, which have to be broadcast to all nFPGA − 1 other FPGAs, and modelling whatever
interconnect topology the system in question is equipped with.
20 Engelhardt, So
5.7 Determining the limiting factor
Overall, taking into account all four factors, we can derive the predicted throughput of an FPGA:
Tsys = min(LPE ,Lmem ,Li f ,Lnet ) (9)
Since we would not want to computationally limit the system while there are resources available, in a rst step we
assume we use the maximum number of PEs available, xing the value of nPE/FPGA in LPE and calling the resulting
value LPEmax .
LPEmax and Lmem increase with nFPGA, whereas Li f and Lnet decrease with nFPGA. erefore, there are multiple
candidates for the optimum value of Tsys. As the network limits are not dened for nFPGA = 1, it might be the best
solution. Otherwise, it will be obtained at the point where the network limits and one of the other limits intersect:
min(LPEmax ,Lmem ) = max(Li f ,Lnet )
is equality is solved for nFPGA, and the resulting value of Tsys compared to the single-FPGA performance to nd
out which is the real solution. Rounding to the nearest integer if necessary, we obtain the value of nFPGA.
If memory or network is found to be the limiting factor, once the number of FPGAs is chosen, the number of PEs
per FPGA can be lowered to save power. Substituting the values of Tsys and nFPGA calculated in the previous step into
equation 1 gives a lower bound on the necessary number of PEs per FPGA to maintain the same performance:
nPE/FPGA ≥
Tsys ×CPEPE
nFPGA × fclk
In this manner, the framework chooses the optimal values for the number of FPGAs (nFPGA) and the number of PEs
per FPGA (nPE/FPGA), optimizing rst for throughput and then for power.
6 PERFORMANCE EVALUATION
6.1 Evaluation platform
To evaluate the performance of the framework-generated designs, we use a system comprised of four Micron AC-510
modules on an EX-750 backplane. Each AC-510 module contains a Xilinx KU060 FPGA and a 4GB Hybrid Memory Cube
connected with two half-width (x8) links with 15Gb/s signaling rate. e EX-750 backplane allows direct communication
via PCIe x8 gen 3 link between the FPGAs without passing through the host.
To apply our analytical model described in section 5, we need to determine the relevant parameters of the platform:
the operating frequency fclk, the PE throughput measureCPEPE, the memory interface bandwidth BWmem, the network
interface bandwidth BWif, and the total network bandwidth BWnetwork. Most of these factors are inuenced by the
implementation of the memory controller and PCIe endpoint provided by Micron, so we experimentally determine
these values.
e operating frequency fclk is determined by using the maximum supported by the memory controller interface,
187.5 MHz.
We derive the PE throughput measure CPEPE from the performance of a system with only one single PE using
on-chip memory. As the network reduces to a single FIFO feeding the PE’s output back into its input, no outside
factors should impede performance in this situation. We measure CPEPE = 1.05 for WCC, CPEPE = 1.10 for BFS, and
CPEPE = 1.42 for PR.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 21
For the memory interface bandwith, we use the GUPS benchmark included with the PicoFramework development
environment. e peak experimentally determined available memory bandwidth is 21.7GiB/s, although for the congu-
ration most closely matching our use case the observed bandwidth is only 8.1GiB/s (as memory bandwidth was never
the limiting factor in our experiments, we did not invest further eort in optimizing this).
Determining the network parameters is more complicated. Communication over PCIe is wrapped by the PicoFrame-
work into streams, which are presented to the user like ordinary 128-bit wide rst-word fall-through FIFOs with one
end on either FPGA. A separate stream has to be opened for communication between every pair of FPGAs in the system.
e PCIe interface bandwidth BWif is divided among the streams, but the streams themselves also have an inherent
maximum throughput of 128 bit every clock cycle. Either of these can be the limiting factor.
We implement a benchmark that continuously sends data to all other FPGAs in the system, and measure the
cumulative sending and receiving bandwidth from all channels on one FPGA when sending data to 1, 2 or 3 other
FPGAs. (e other FPGAs are also sending data to each other, in the same conguration as is used in GraVF/GraVF-M.)
To mirror our use case, the always-on sender and receiver are located in the 187.5MHz clock domain, and connected to
the stream clock domain (200MHz) through an asynchronous FIFO. While one would think that this would result in less
data transmied than if the sender is directly in the stream clock domain, it does in fact lead to 20% higher bandwidth
transmissions in the 4-FPGA case, presumably due to congestion eects. e results are shown in table 2. We nd that
the network interface bandwidth is BWif = 11.7GiB/s, but that performance is further restricted in the 2-FPGA and
3-FPGA case by the FIFO’s interface bandwidth.
nFPGA send (GiB/s) receive (GiB/s)
2 2.79 2.79
3 5.59 5.59
4 5.85 5.85
Table 2. Observed network bandwidth
As there is no slowdown at the maximum conguration, we are unable to determine BWnetwork precisely, but we can
determine that it is larger than the amount of data sent by all 4 FPGAs simultaneously, 23.4 GiB/s, and that it is never
the limiting factor on this platform.
6.2 Algorithms and datasets
We implement kernels for three algorithms for use with our framework: Breadth-First Search, Weakly Connected
Components, and PageRank. ese algorithms are widely used by previous work and thus allow easy performance
comparison. ey also cover distant points on the spectrum of workloads, from the highly variable (BFS) to the regular
(PR).
Breadth-First Search (BFS). BFS is a kernel that forms the base for a wide variety of applications, including the
graph500 supercomputing ranking (Murphy et al. 2010). e most common variant, which we adopt, is one that visits
all vertices in the graph in order to compute a minimum spanning tree, rather than a true search (which might be
abandoned early upon nding a specic vertex) as the name suggests. Each vertex stores the ID of its parent vertex in
the spanning tree.
22 Engelhardt, So
In BFS, each vertex is active only once, during the superstep where it is rst visited. Exactly one message is sent for
each edge in the graph. e number of active vertices is highly variable over the course of execution, starting from a
single active vertex in the rst superstep, growing to encompass a large portion of the graph within a few hops, and
nally reducing again to only a few remote vertices as it nears completion.
Weakly Connected Components (WCC). e WCC algorithm detects unconnected portions of a graph. Each connected
set of vertices agrees on a common label by propagating the lowest seen vertex ID to all neighbors. (is algorithm is
described in detail in section 3, where it is used as an example illustrating the kernels syntax accepted by GraVF-M.)
Many vertices will be active more than one superstep, especially those with high vertex IDs, as progressively lower
vertex IDs are propagated. However, similarly to BFS the total amount of activity is limited by the diameter of the graph
and wanes towards the end of the computation.
PageRank (PR). PageRank is a ranking algorithm that assigns a oating-point score to each vertex in the graph
based on the score of its neighbors, computed over multiple iterations towards a xed point. We have adapted the
vertex-centric implementation from Pregel (Malewicz et al. 2010).
During each superstep, each vertex contributes 1n th of its score from the previous iteration to each of its neighbors’
scores, where n is the number of (outgoing) neighbors that the vertex is connected to. Computation is run for 30
supersteps. All vertices are always active, and a message is sent over each edge of the graph in each superstep. PR is
thus a heavier but more regular computation than the other two algorithms. e amount of computation per vertex per
iteration is also larger, consisting of multiple oating point operations which combine to give the pipelined kernel a
latency of several hundred cycles.
Datasets. We use two types of synthetic graphs to evaluate our framework. Uniform graphs have edges distributed
evenly, with most vertices close to the average degree. Edges are generated with equal probability for any pair of
vertices. Scale-free graphs have extremely skewed degree distributions following a power law. Most vertices have low
degrees, but a few are highly connected. ese datasets are obtained from the RMAT generator (Chakrabarti et al. 2004).
Measurement. We evaluate performance using a common throughput measure for graph algorithms, traversed edges
per second (TEPS). It is not well-dened in the literature what counts as traversing an edge, especially for algorithms
like WCC where vertices can receive new data from the same neighbor multiple times during the course of execution.
Here, we consider that an edge is traversed every time information is transferred along it, i.e. when the scaer kernel
sends a message. e overall system throughput Tsys is calculated as the total number of messages sent over the course
of the execution divided by the runtime.
6.3 Experiments
In these experiments, we aim to evaluate both the general performance of the framework-generated designs, and to
compare the predictions of the performance model to the actual results.
6.3.1 Multi-FPGA Scaling. To measure the eect of the improved architecture introduced in this paper, we compare
the performance of GraVF-M to GraVF when scaling from one to four FPGA boards. Each FPGA is congured to use 9
PEs, with the use of o-chip HMC memory for edge data enabled. We use a uniform graph dataset, and partition it
among the FPGAs using the greedy edge-based heuristic. Results are shown in Fig. 7.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 23
 0
 1000
 2000
 3000
 4000
 5000
 6000
 1  2  3  4
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
Number of FPGAs
PageRank
 0
 1000
 2000
 3000
 4000
 5000
 6000
 1  2  3  4
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
Number of FPGAs
BFS
 0
 1000
 2000
 3000
 4000
 5000
 6000
 1  2  3  4
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
Number of FPGAs
WCC
Computation limit
Network interface limit for GraVF
GraVF-M
GraVF
Fig. 7. Throughput for a multi-FPGA system
Overall, GraVF-M exhibits much beer scaling performance. Although GraVF is still preferable when run on a single
FPGA, GraVF-M achieves a 2.2 − 2.8× speedup when spreading computation across 4 FPGAs.
In the following sections we will examine some aspects of GraVF-M performance in more detail.
6.3.2 Single-FPGA Performance. We evaluate the PE performance when unconstrained by factors such as memory
or inter-FPGA network. In this experiment, we only use a single FPGA, and we compare the GraVF and GraVF-M
designs. Fig. 8 shows the performance of both designs for a uniform graph dataset.
In exchange for beer multi-FPGA performance, the single-chip performance of GraVF-M is reduced compared to
GraVF when using BRAM. e GraVF-M on-chip network is more prone to communication conicts, as the broadcast
mechanism results in the PEs sending nearly in lockstep. A single PE unable to accept more data also blocks all PEs
from sending, whereas in GraVF PEs would still be able to send messages to other destinations, and only stop when
chancing upon a message destined for the blocking PE. A large part of this performance can be regained when using
the o-chip HMC memory, as the 64 in-ight requests per PE in eect provide extra buer space.
24 Engelhardt, So
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
 
PageRank
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
Number of PEs
BFS
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
 
WCC
Computation limit
GraVF-M with HMC
GraVF-M with BRAM
GraVF with HMC
GraVF with BRAM
Fig. 8. GraVF-M vs. GraVF performance on uniform graph.
6.3.3 Eects of average degree. e core GraVF-M architecture improvement introduces a dataset-dependent term
|E |/|V | when the performance is limited by the inter-FPGA network. Consequently, GraVF-M performance should be
beer for graphs with higher average degree.
 0
 1000
 2000
 3000
 4000
 5000
 6000
2 4 8 16 32 64
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
Average degree
BFS
WCC
PR
Fig. 9. Eect of average degree on GraVF-M performance.
We generate uniform graphs with a constant number of vertices and average degrees ranging from 2 to 64. Fig. 9
compares the performance of a 4-FPGA GraVF-M design using o-chip HMC memory with 9 PE per FPGA and with
greedy-edge partitioning. e inuence of the average degree (|E |/|V |) on performance can be clearly seen. At very low
degree, the performance (especially of the BFS algorithm) is more variable as other eects dominate (BFS is sensitive to
the shape of the graph). At the highest degree shown (64), the computational limit of the PEs is reached before the
network limit, especially for PR which has a higher CPE.
is reliance on the average degree does mean that GraVF-M does not benet low-degree datasets such as road
networks as much. On a subgraph of the Pennsylvania road network obtained from the SNAP database (Leskovec and
Krevl 2014), which has an average degree of 2.8, we achieve a performance of 182 MTEPS for BFS, 720 MTEPS for PR
and 663 MTEPS for WCC using 4 FPGAs.
6.3.4 Latency eects. To estimate the eciency of the barrier synchronization, we create a set of synthetic graphs
with the structure depicted in Fig. 10, which allows us to precisely control the number of supersteps (d + 1) and the
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 25
number of vertices active in each superstep (w) for the BFS algorithm (the most latency-sensitive of our algorithms).
e solid edges represent the BFS minimum spanning tree, and the additional edges (drawn dashed) between vertices of
same rank serve to achieve a higher average degree without aecting when a vertex will be active.
w
d
Fig. 10. Structure of the synthetic graphs used in this experiment. The single vertex on the le is selected as root for the BFS
algorithm.
In a rst step, we run BFS on a line graph with 16385 vertices (i.e. the graph obtained for w = 1,d = 16384 and with
no additional edges). is graph will have only a single vertex active each superstep, for 16385 consecutive supersteps.
As the amount of work per superstep is negligible, the time used per superstep is entirely aributable to synchronization
latency. From this experiment we obtain a latency of 676 cycles/superstep, or 3.6µs, for a system of 4 FPGAs with 9 PEs
each, using HMC (i.e. the same system as in Fig. 7). is can be compared to the round-trip time between two FPGAs in
this system, which is 1.6µs.
 0
 1000
 2000
 3000
 4000
 5000
16385 8192 4096 2048 1024 512 256 128 64
Throughput-limited Latency-limited
Th
ro
ug
hp
ut
 (M
TE
PS
)
Number of active vertices per superstep (w)
BFS
Fig. 11. Throughput obtained for synthetic graphs as in Fig. 10 when increasing d and decreasing w while keeping the total vertex
and edge count constant.
26 Engelhardt, So
In a second step, we create a series of these synthetic graphs with a constant number of vertices, but distributed
across an increasing number of supersteps. e average degree for the graphs is set to 64, and the number of vertices
active in each superstep w is decreased from 16384 to 64, while d is increased from 1 to 256 supersteps. Fig. 11 shows
the throughput on this series of graphs, for the same system of 4 FPGAs with 9 PEs each, using HMC. It can be seen
that latency becomes a signicant factor starting at 2k active vertices per superstep, which corresponds to 128k edges
traversed per superstep. As the number of supersteps increases and the amount of work per superstep decreases, the
time per superstep converges towards the synchronization latency previously determined.
6.3.5 Partitioning. In this experiment we examine the eectiveness of the partitioning methods. Partitioning is
more challenging for scale-free graphs, so we use the RMAT-generated graphs. e system is congured to use HMC
memory.
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
 
PageRank
Roundrobin
Greedy Edge
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
Number of PEs
BFS
 0
 250
 500
 750
 1000
 1250
 1500
1 2 4 8
 
WCC
Fig. 12. Round-robin vs. greedy edge-based partitioning.
Fig. 12 shows the comparison of the round-robin and greedy partitioning strategies on single-FPGA systems
of 1 to 8 PEs. e greedy strategy shows a slight advantage, but the round-robin partitioning is not signicantly
poorer-performing. Greedy edge-based partitioning was adopted as the default strategy.
 0
 1000
 2000
 3000
 4000
 1  2  3  4
Sy
st
em
 T
hr
ou
gh
pu
t (M
TE
PS
)
 
PageRank
Metis
Greedy Edge
 0
 1000
 2000
 3000
 4000
 1  2  3  4
Number of FPGAs
BFS
 0
 1000
 2000
 3000
 4000
 1  2  3  4
 
WCC
Fig. 13. Greedy edge-based partitioning vs. METIS.
Fig. 13 compares the performance when scaling across multiple FPGAs using location-agnostic partitioning from the
greedy edge-based heuristic, or a high-quality partition obtained from the METIS tool (Karypis and Kumar 1998) that
aempts to minimize the number of edges crossing FPGA boundaries. As METIS runtime is several thousand times
higher than the runtime of the algorithms under investigation, it should be viewed more as an upper bound on the
eects of partitioning quality rather than as a realistic option. e greedy edge-based heuristic is able to achieve results
within 5% of METIS.
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 27
6.4 Comparison to other frameworks
As is noted by (Oguntebi and Olukotun 2016), the commonly-used throughput measure MTEPS (Millions of Traversed
Edges Per Second) is not always comparable across dierent works, because the notion of what constitutes ‘traversing’
an edge is not well-dened. e related works examined in this section were not described in enough detail to ascertain
the authors’ denition of traversal; as such we can merely take the reported numbers at face value and assume that it
matches our denition.
6.4.1 Multi-FPGA frameworks. ForeGraph (Dai et al. 2017) is the only multi-FPGA framework to compare our work
with. eir work is evaluated by simulating a 4-FPGA platform similar to ours (memory bandwidth of 19.2 GB/s vs.
21.7GiB/s, network bandwidth 1.4 GiB/s vs. 1.5 GiB/s), and with the same benchmark algorithms, on a social graph of
comparable edgefactor (35, vs. 32 for the dataset that our results were obtained on). Table 3 shows the peak performance
reported for ForeGraph and that obtained for our work. GraVF-M achieves performance 2.5 − 3.8× that of the closest
comparable work.
Algorithm ForeGraph GraVF-M
PageRank 1856 4623
BFS 1458 5493
WCC 1727 5791
Table 3. Peak performance obtained for dierent algorithms (MTEPS)
6.4.2 Single-FPGA frameworks. We can also compare the performance of single-FPGA frameworks to the per-FPGA
performance of GraVF-M by dividing the throughput by the number of FPGAs. e following works report throughput
numbers on one of our benchmark algorithms in a compatible format:
(Zhou et al. 2016) achieves 747-1068 MTEPS for the WCC algorithm, on a platform with a memory bandwidth of
19.2 GB/s and on datasets with edgefactor 2-14. Our work has a throughput of 1447 MTEPS per FPGA.
(Zhou and Prasanna 2017) reports executing BFS at 670 MTEPS, on a hybrid platform comprised of a CPU with
memory bandwidth 30 GB/s and an FPGA with memory bandwidth 12.8 GB/s, on synthetic graphs with edgefactors of
4 to 16, compared to 1373 MTEPS per FPGA for GraVF-M.
(Zhou et al. 2018) includes an evaluation using PR, on a platform with 60 GB/s memory bandwidth with 1116-
2487 MTEPS on social graphs with edgefactors 2-35. Our work has a per-FPGA throughput of 1155 MTEPS on this
algorithm.
(Oguntebi and Olukotun 2016) shows throughput of approximately 110 MTEPS for PR, but it is very likely that their
denition of edge traversal diers. eir SpMV algorithm achieves 750 MTEPS and PageRank can be implemented as
a succession of sparse matrix-vector multiplications, so the performance of the two should not dier this drastically
under our denition.
Other works’ performance is even more dicult to compare, as they use dierent algorithms for their evaluation or
report their numbers in speedup against some soware baseline (or in the case of (Zhou et al. 2017), GPU).
28 Engelhardt, So
7 CONCLUSION
In this work we have presented the GraVF-M framework, which generates graph processing designs optimized for
multi-FPGA platforms. is framework greatly simplies the implementation of FPGA based accelerators for vertex-
centric graph processing algorithms, while oering aractive performance. Our proposed architecture allows the
familiar message-passing paradigm to be maintained, while exploiting structural features of the programming model to
transfer portions of the computation to the receiver in order to minimize inter-FPGA network trac.
A 4-FPGA system generated by GraVF-M reaches 4.6-5.8 GTEPS, which compares favorably to other works in
the literature. It is also up to 94% of the theoretical peak performance projected by our performance analysis. e
performance model we developed can be used to evaluate the likely performance of platforms based on their memory
and network interface capabilities before proceeding to implementation.
ACKNOWLEDGMENTS
is work was supported in part by the Research Grants Council of Hong Kong (Project GRF 17245716), and the
Croucher Foundation (Croucher Innovation Award 2013). e authors would also like to thank Micron for the loan of
the Pico SC-6 Mini evaluation platform.
REFERENCES
O.G. Aia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. CyGraph: A Recongurable Architecture for Parallel Breadth-First Search. In
Parallel Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. 228–235. hps://doi.org/10.1109/IPDPSW.2014.30
O. G. Aia, A. Grieve, K. R. Townsend, P. Jones, and J. Zambreno. 2015. Accelerating all-pairs shortest path using a message-passing recongurable
architecture. In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6. hps://doi.org/10.1109/ReConFig.2015.7393284
Brahim Betkaoui, David B omas, Wayne Luk, and Natasa Przulj. 2011. A framework for FPGA acceleration of large graph problems: graphlet counting
case study. In 2011 International Conference on Field-Programmable Technology (FPT). IEEE, 1–8.
B. Betkaoui, Y. Wang, D. B. omas, and W. Luk. 2012. A Recongurable Computing Approach for Ecient and Scalable Parallel Graph Exploration. In
23rd International Conference on Application-Specic Systems, Architectures and Processors (ASAP). hps://doi.org/10.1109/ASAP.2012.30
D. Chakrabarti, Y. Zhan, and C. Faloutsos. 2004. R-MAT: A Recursive Model for Graph Mining. 442–446. hps://doi.org/10.1137/1.9781611972740.43
arXiv:hp://epubs.siam.org/doi/pdf/10.1137/1.9781611972740.43
Convey Computer. 2011. Convey Computer Doubles Graph500 Performance, Develops New Graph Personality. (November 2011). Press release.
Convey Computer. 2012. New Convey MX™ Demonstrates Leading Power/Performance on Graph 500 Benchmark. (November 2012). Press release.
Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search. In
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16). ACM, New York, NY, USA, 105–110.
hps://doi.org/10.1145/2847263.2847339
G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture. In
International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 10. hps://doi.org/10.1145/3020078.3021739
Michael Delorimier, Nachiket Kapre, Nikil Mehta, and Andre´ Dehon. 2011. Spatial Hardware Implementation for Sparse Graph Algorithms in GraphStep.
ACM Trans. Auton. Adapt. Syst. 6, 3, Article 17 (Sept. 2011), 20 pages. hps://doi.org/10.1145/2019583.2019584
M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T.E. Uribe, T.F. Knight Jr., and A. DeHon. 2006. GraphStep: A System Architecture for Sparse-
Graph Algorithms. In 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM). hps://doi.org/10.1109/FCCM.2006.45
Nina Engelhardt, Dominic C.-H. Hung, and Hayden K.-H. So. 2018. Performance-Driven System Generation for Distributed Vertex-Centric Graph
Processing on Multi-FPGA Systems. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 215–2153. hps:
//doi.org/10.1109/FPL.2018.00043
N. Engelhardt and H. K. H. So. 2016. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In 2016 26th International Conference on
Field Programmable Logic and Applications (FPL). hps://doi.org/10.1109/FPL.2016.7577360
Nina Engelhardt and Hayden Kwok-Hay So. 2017. Towards Flexible Automatic Generation of Graph Processing Gateware. In International Symposium on
Highly-Ecient Accelerators and Recongurable Technologies (HEART).
N. Kapre. 2015. Custom FPGA-based so-processors for sparse graph acceleration. In 26th International Conference on Application-specic Systems,
Architectures and Processors (ASAP). hps://doi.org/10.1109/ASAP.2015.7245698
G. Karypis and V. Kumar. 1998. A Fast and High ality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientic Computing 20, 1
(1998), 359–392. hps://doi.org/10.1137/S1064827595287997 arXiv:hps://doi.org/10.1137/S1064827595287997
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms 29
Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerating Graph Analytics by Co-Optimizing Storage and Access on an
FPGA-HMC Platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’18). ACM, New York,
NY, USA, 239–248. hps://doi.org/10.1145/3174243.3174260
Richard E. Korf. 2009. Multi-way Number Partitioning. In Proceedings of the 21st International Jont Conference on Artical Intelligence (IJCAI’09). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 538–543. hp://dl.acm.org/citation.cfm?id=1661445.1661531
Jinho Lee, Heesu Kim, Sungjoo Yoo, Kiyoung Choi, H. Peter Hofstee, Gi-Joon Nam, Mark R. Nuer, and Damir Jamsek. 2017. ExtraV: Boosting Graph
Processing Near Storage with a Coherent Accelerator. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1706–1717. hps://doi.org/10.14778/3137765.3137776
Guoqing Lei, Rongchun Li, Song Guo, and Fei Xia. 2015. TorusBFS: A Novel Message-passing Parallel Breadth-First Search Architecture on FPGAs.
IRACST - Engineering Science and Technology: An International Journal (ESTIJ) 5, 5 (Oct 2015), 313–318.
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. hp://snap.stanford.edu/data. (June 2014).
M-Labs. 2012. Migen. (2012). hp://m-labs.hk/migen
Xiaoyu Ma, Dan Zhang, and Derek Chiou. 2017. FPGA-Accelerated Transactional Execution of Graph Workloads. In Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays (FPGA ’17). ACM, New York, NY, USA, 227–236. hps://doi.org/10.1145/3020078.3021743
G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. In ACM
SIGMOD International Conference on Management of Data. ACM.
Richard C Murphy, Kyle B Wheeler, Brian W Barre, and James A Ang. 2010. Introducing the Graph500. Cray User’s Group (CUG) (2010).
E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martı´nez, and C. Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric
Graph Computation. In 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 25–28.
Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataow Library for Graph Analytics Acceleration. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays (FPGA ’16). ACM, New York, NY, USA, 111–117. hps://doi.org/10.1145/2847263.2847337
Y. Umuroglu, D. Morrison, and M. Jahre. 2015. Hybrid breadth-rst search on a single-chip FPGA-CPU heterogeneous platform. In International Conference
on Field Programmable Logic and Applications (FPL). hps://doi.org/10.1109/FPL.2015.7293939
Q. Wang, W. Jiang, Y. Xia, and V. Prasanna. 2010. A message-passing multi-socore architecture on FPGA for Breadth-rst Search. In International
Conference on Field-Programmable Technology (FPT). hps://doi.org/10.1109/FPT.2010.5681757
J. Zhang, S. Khoram, and J. Li. 2017. Boosting the Performance of FPGA-based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First
Search. In International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 10. hps://doi.org/10.1145/3020078.3021737
Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform. In Proceedings of the 2018 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA ’18). ACM, New York, NY, USA, 229–238. hps://doi.org/10.1145/3174243.3174245
J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen. 2017. TuNao: A High-Performance and Energy-Ecient
Recongurable Accelerator for Graph Processing. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
731–734. hps://doi.org/10.1109/CCGRID.2017.114
S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-roughput and Energy-Ecient Graph Processing on FPGA. In 2016 IEEE 24th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM). 103–110. hps://doi.org/10.1109/FCCM.2016.35
Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-centric Graph Processing. In Proceedings of
the 15th ACM International Conference on Computing Frontiers (CF ’18). ACM, New York, NY, USA, 69–77. hps://doi.org/10.1145/3203217.3203233
S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In 2017 29th International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD). 137–144. hps://doi.org/10.1109/SBAC-PAD.2017.25
Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical
Partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 375–386. hps://www.usenix.org/
conference/atc15/technical-session/presentation/zhu
