GRIP: A Graph Neural Network Accelerator Architecture by Kiningham, Kevin et al.
GRIP: A Graph Neural Network Accelerator
Architecture
Kevin Kiningham∗, Christopher R, Philip Levis
Stanford University
∗kkiningh@stanford.edu
Abstract—We present GRIP, a graph neural network accelera-
tor architecture designed for low-latency inference. Accelerating
GNNs is challenging because they combine two distinct types of
computation: arithmetic-intensive vertex-centric operations and
memory-intensive edge-centric operations. GRIP splits GNN infer-
ence into a fixed set of edge- and vertex-centric execution phases
that can be implemented in hardware. We then specialize each
unit for the unique computational structure found in each phase.
For vertex-centric phases, GRIP uses a high performance matrix
multiply engine coupled with a dedicated memory subsystem for
weights to improve reuse. For edge-centric phases, GRIP use
multiple parallel prefetch and reduction engines to alleviate the
irregularity in memory accesses. Finally, GRIP supports several
GNN optimizations, including a novel optimization called vertex-
tiling which increases the reuse of weight data.
We evaluate GRIP by performing synthesis and place and
route for a 28 nm implementation capable of executing inference
for several widely-used GNN models (GCN, GraphSAGE, G-
GCN, and GIN). Across several benchmark graphs, it reduces
99th percentile latency by a geometric mean of 17× and 23×
compared to a CPU and GPU baseline, respectively, while
drawing only 5W.
Index Terms—Deep Learning; Hardware Acceleration;
Algorithm-Hardware co-Design; ASIC;
I. INTRODUCTION
Traditional deep neural networks (DNNs) rely on regularly
structured inputs (e.g. vectors, images, or sequences) making
them difficult to use in domains where data is naturally
irregular (e.g. user connections on social media). Graph neural
networks (GNNs) tackle this limitation by extending DNNs to
allow arbitrarily structured graph-valued inputs, where feature
vectors are associated with the edges and vertices of a graph1.
GNNs have found significant success in a range of practical
tasks, including surfacing related content on social media [46],
recommending meals on delivery platforms [24], and improving
circuit testability for EDA [32].
GNNs combine two distinct types of operations [17], [31]:
vertex-centric, which are associated with graph vertices, and
edge-centric, which are associated with edges. Vertex-centric
operations are computationally regular and primarily consist
of multiplying vertex feature vectors by large weight matrices.
These weights are shared across all vertices, leading to
significant opportunities for data reuse. Edge-centric operations
are similar to those found in graph analytics (e.g. neighborhood
reduction [42]). Their computational structure depends on the
1 Following the convention in prior work [21], for clarity we call a GNN’s
input a graph and the GNN itself a network.
often sparse and irregular structure of the input graph. This
results in many random memory accesses and limited data
reuse, but also requires relatively little computation.
The combination of these two types of computation makes
GNN inference inefficient on existing architectures. As a result,
GNNs have much higher inference latency than other neural
networks, limiting them to applications where inference can
be pre-computed offline [46]. Most DNN accelerators (e.g.
the TPU [25]) are optimized for dense, regular computation,
making edge operations hard to implement efficiently [3]. Graph
analytics accelerators (e.g Graphicionado [20]) are designed
for workloads that require little computation per-vertex and
have difficulty exploiting data reuse in vertex-centric operations.
Prior work has demonstrated inference on CPUs and GPUs
is limited by architectural issues, such as cache and memory
bandwidth bottlenecks [19], [45].
This paper proposes GRIP (GRaph Inference Processor), an
accelerator architecture designed for low-latency GNN infer-
ence. GRIP’s programming model is inspired by GReTA [27],
a decomposition of GNN inference into a fixed set of edge-
and vertex-centric phases. GRIP implements each phase with
separate specialized on-chip memory and execution units. For
example, GRIP alleviates irregularity in the edge-accumulate
phase by using multiple parallel prefetch engines to load data.
This allows GRIP to support a broader class of GNNs than
prior work, including emerging models that perform complex
computation per-edge. Finally, GRIP includes hardware support
for several optimizations: caching partitions of feature data,
inter-phase pipelining, and preloading weights between layers.
We also introduce a novel GNN optimization called vertex-
tiling that substantially improves latency by increasing the
reuse of weight values during inference.
A. Contributions
This paper makes the following contributions:
1) GRIP, an accelerator architecture for low-latency GNN
inference. GRIP is efficient across a wide range of models
and has numerous hardware optimizations to improve
inference latency.
2) A novel optimization for GNN inference called vertex-
tiling, which improves performance by increasing reuse
of weights.
3) A detailed description of a 28 nm implementation of
GRIP capable of executing four representative GNNs
(GCN, GraphSAGE, G-GCN, and GIN). Evaluated across
ar
X
iv
:2
00
7.
13
82
8v
1 
 [c
s.A
R]
  2
7 J
ul 
20
20
BA
D
E F
C
(a) Input graph.
B
A C ED
A B D E F
Layer 2
Layer 1
(b) Nodeflow.
Message Passing Layer
2. Aggregate1. Send
B
D B
A
C
D
E
MPL
A
E
F
3. Update
…
Shared
Weights
(c) Inference dataflow.
Fig. 1: An example of performing GCN inference on vertex B
with two layers. The nodeflow (b) describes the propagation
of features during inference (c).
several benchmark graphs, our implementation reduces
99th percentile latency by a geometric mean of 17× and
23× compared to an Intel Xeon CPU and Nvidia P100
GPU baseline, respectively.
II. BACKGROUND AND MOTIVATION
A. Graph Neural Networks
GNNs [4], [43] are a class of DNN that operate on graph-
valued data. Unlike traditional DNNs, GNNs directly take
advantage of graph structure during learning and inference.
For example, consider the task of classifying web-pages by
topic. A pure content approach (e.g. a classic recurrent neural
network) considers only features derived from a page’s content.
However, a significant amount of information is stored in the
structure of links between pages. By modeling these links as
a graph, a GNN can natively leverage both page content and
link structure. GNN-based methods have achieved state of
the art performance on a diverse set of graph-related tasks,
including link prediction [50], vertex classification [46], and
clustering [47].
Message-Passing Layer. Modern GNNs are typically com-
posed of multiple message-passing layers [17], shown in Alg. 1.
The layer takes as input a graph G consisting of a set of vertices
V and edges E. Each vertex and edge is assigned a feature
vector hv and h(u,v) respectively. Computation is split into
three operations:
• Send computes a message vector mu,v for each edge.
• Aggregate reduces incoming messages for each vertex to
a vector av . The neighborhood function N(v) determines
Algorithm 1 Message Passing Layer Forward Pass
Input: Graph G = (V, E); Vertex and edge features hv,
h(u,v)
Output: Updated vertex features zv
1: for (u, v) in E do
2: mu,v ← Send(hv, hu, h(u,v))
3: for v in V do
4: av ← Aggregate({mu,v | u ∈ N(v)})
5: zv ← Update(hv, av)
which messages are considered, typically using a fixed
size random sample.
• Update combines each vertex’s current value with the
output of aggregation to produce an updated vector zv .
By iteratively applying K of these layers, the final state for
each vertex captures information about the structure of its
K-hop neighborhood.
Pooling and Readout. Two other layer types are also used
in some GNNs. Pooling defines a method of coarsening a graph
by combining the features of clusters of vertices. Readout is a
special case of pooling which uses a single graph-wide cluster
to produce a representation for an entire graph (e.g. for graph
classification). In this paper, we treat both layers as slightly
modified versions of message-passing, where edges connect
vertices to clusters rather than other vertices.
Nodeflow. A nodeflow [22] is a bipartite data structure
that describes how features are propagated during message-
passing. It is typically generated during a preprocessing step
before inference, but can also be created on-demand (e.g. for
dynamic graphs). The nodeflow is most useful when performing
inference on a subset of the graph since it makes it easy to
determine which edges and vertices are required to update
a specific vertex. It can also be used to separate sampling
from inference by precomputing the neighborhood function
and encoding the result directly in the nodeflow.
In this paper, we denote the nodeflow for a particular layer
as the three-tuple (U, V,E), where U is the set of vertices read
during inference, V is the set of vertices updated, and E is a
set of edges connecting vertices in U to V . Fig. 1 shows an
example of using the nodeflow to compute inference in a two
layer GNN.
GCN. We use the Graph Convolutional Network [28] (GCN)
as a concrete running example of a GNN model. GCN uses
multiple message-passing layers with the following send,
aggregate, and update operations
mu,v ← hu
av ← mean({mu,v | u ∈ N(v)})
zv ← ReLU(Wav)
where W is a trainable weight matrix. We can rewrite this to
use sparse-dense matrix multiplication (SpMM)
Z ← ReLU(AˆHW ) (1)
2
where Aˆ is a sparse matrix derived from the nodeflow and H
and Z are dense matrices formed from the set of input and
output features respectively. This allows GCN inference to be
implemented using operations from highly optimized sparse
matrix libraries, such as Intel MKL [23] or cuSPARSE [35].
B. Performance Challenges of GNNs
To demonstrate the performance challenges of GNNs in
practice, we implement 2-layer GCN using the SpMM form
in Eq. 1. Our implementation uses Tensorflow compiled with
Intel MLK run on a single socket of an Intel Xeon E5-2690v4.
In Fig. 2, we plot measured performance verses arithmetic
intensity for each vertex in the Pokec dataset. Arithmetic
intensity depends on the number of unique neighbors that
must be read during inference, which is determined by the
local graph structure of each vertex.
Fig. 2: CPU performance of GCN inference for vertices in
the Pokec [29] dataset. Bottlenecks in cache bandwidth result
in a significant gap between measured performance and the
roofline upper bound.
In this dataset, inference performance is theoretically bottle-
necked by off-chip memory bandwidth for all vertices. However,
there is a significant gap between the theoretical upper bound
and the actual measured performance at higher levels of
arithmetic intensity. Profiling shows the primary bottleneck
is last level cache bandwidth, a result consistent with prior
analysis of GPU performance [19]. In our experiment, the
highest arithmetic intensities occur when a vertex appears
in multiple neighborhoods and its feature vector can be
reused. However, if multiple cores are reading or writing a
vertex in parallel, this also results in higher utilization of
cache bandwidth. Additionally, features must compete with
large weight values that also occupy the cache and consume
bandwidth during the vertex-centric Update operation.
Opportunities for Acceleration. While GNN inference per-
formance may be limited on existing hardware, the difficulties
described in this section can be overcome with a custom
architecture. In particular, we propose using separate specialized
memory and execution units for each edge- and vertex-centric
operation. To specialize for vertex-centric operations, we use a
dedicated high performance matrix-multiplication unit. Weights
are stored on-chip in dedicated memory with a level of caching
to improve reuse. For edge-centric operations, we prefetch data
for multiple edges in parallel and specialize the on-chip feature
memory to enable fast gather and reduction operations. Finally,
since the nodeflow is known statically, we can also improve off-
chip access efficiency by scheduling bulk transfers of feature
data rather than loading on demand during execution. Taken
together, this gives a significant opportunity for improving
GNN inference performance.
III. RELATED WORK
DNN Accelerators. A significant number of custom neural
network accelerates have been developed, mostly focused on
dense operations [8], [9], [10], [13], [14], [15], [16], [25],
[30], [41], [49]. However, edge-centric operations are difficult
to implement efficiently on these architectures [3], which are
much more computationally irregular than traditional DNNs.
GRIP natively supports edge-centric operations by using a
graph-processing based programming model (Sec. IV) and by
a combination of specialized memory for edge accesses and
software techniques. In Sec. VIII-F we estimate GRIP to be
2.4× faster than a comparable TPU-like accelerator modified
specifically to improve GNN inference performance.
GCN Accelerators. HyGCN [45] and GraphACT [48]
are two accelerators designed to for graph convolutional
networks, a subclass of graph neural networks. Like GRIP,
these accelerators use separate edge- and vertex-centric units for
GNN computation. GRIP builds on these designs by handling
a much more general set of GNNs that includes models that
use computation associated with edges. This is important for
many emerging state-of-the-art GNNs, such as Graph Attention
Networks [40]. Additionally, GRIP’s support for vertex-tiling
reduces the amount of weight bandwidth required during vertex-
oriented operations. In Sec. VIII-F we estimate this improves
performance by 4.5× compared to HyGCN.
Graph Analytics Accelerators. Specialized accelerators
have also been proposed for graph analytics workloads [34],
[36], [37]. However, these workloads require relatively little
computation per-vertex and typically use scalars rather than
large feature vectors. Thus, the computation and memory access
patterns are very different. In Sec. VIII-F, we estimate GRIP
to be 8.1× faster than the approach of Graphicionado [20].
GNN Optimizations. Many optimizations have been pro-
posed to improve GNN performance. Common techniques
include scheduling computation to reducing the impact of
sparsity [3], [11], [31], improved sampling [7], or eliminating
redundant computation [48]. These techniques are compatible
with GRIP and can be used for additional performance.
IV. PROGRAMMING MODEL
GRIP’s programming model is based on GReTA [27], a
graph-processing abstraction specialized for implementing
GNNs. GReTA decomposes GNN layers into four stateless user-
defined functions (UDFs): gather, reduce, transform,
and activate. GRIP invokes each UDF in one of three
execution phases: edge-accumulate, vertex-accumulate, and
vertex-update. GRIP also allows programs to be composed by
using the result of one program as the features or accumulator
in another. This flexibility allows a wide range of GNN models
to be implemented.
Data Model. UDFs are restricted in the types of data they
can access to in order to simplify hardware implementation.
3
Algorithm 2 GRIP Program Execution Semantics
Input: Layer nodeflow (U, V,E); Vertex data hu and hv;
Edge data h(u,v); Accumulators ev and av; Weights and
biases W
Output: Updated vertex data zv
1: /* Edge-Accumulate Phase */
2: for (u, v) in E do
3: ev = reduce(ev , gather(hu, hv , h(u,v)))
4: /* Vertex-Accumulate Phase */
5: for v in V do
6: av = transform(av , ev , W )
7: /* Vertex-Update Phase */
8: for v in V do
9: zv = activate(av)
GRIP programs use four types of data: (1) A nodeflow NF =
(U, V,E) encodes computational structure by defining the
vertices and edges to read and update. (2) Feature vectors hu,
hv , and h(u,v) associated with nodeflow input vertices, output
vertices, and edges respectively. (3) A set of constant layer
weights W . (4) Edge-accumulator ev and vertex-accumulator
av associated with each output vertex.
Execution Semantics. UDFs are executed in three phases:
1) Edge-accumulate iterates over nodeflow edges and in-
vokes gather and reduce. Gather reads features
associated with an edge to produce a message value.
Reduce accumulates messages sharing an output vertex
into ev . This results in a single value per output vertex.
2) Vertex-accumulate iterates over each output vertex and
combines ev with the previous accumulator state av
using transform. Transform is the only UDF
with access to layer weights and is usually the most
computationally expensive operation in a layer (e.g.
matrix multiplication).
3) Vertex-update again iterates over each output vertex
and applies activate to av. The activate UDF
typically implements the non-linear operations required
in a layer (e.g. the activation function). This produces a
final updated value for each vertex z′v .
Alg. 2 shows the full execution semantics of a GRIP program.
A. Layer Implementation
The decomposition used by GRIP is expressive enough to
allow implementing a wide variety of GNNs. Implementing a
particular layer is typically straightforward since each phase
naturally maps to the operations of the massage-passing layer
introduced in Sec. II. However, some complex models may
require a layer to be split into multiple programs. This is
especially true for models that require significant computation
per-edge. For example, consider the following modified GCN
Send operation
mu,v ←W0hu (2)
This cannot be mapped directly to gather and reduce
since they do not have access to the layer weights. Instead, we
Edge-Accumulate Vertex-Accumulate Vertex-Update
(a) (b)
Fig. 3: Modifying the GCN Send operation (Eq. 2) requires
splitting the layer into two sequential programs (a) and (b).
The dashed box in (a) indicates a phase with no computation.
(a) GraphSAGE [21]
(b) GIN [44]
(c) G-GCN [2], [5], [33]
Fig. 4: GRIP implementation of several GNN models. Plus-
boxes indicate the output of one program is used as the edge
or vertex-accumulator of another. Phases with no associated
computation are omitted.
implement this layer by splitting it into two GRIP programs as
shown in Fig. 3. Note that splitting a layer may result in each
program iterating over a different nodeflow. For example, the
program in Fig. 3a iterates over an identity nodeflow where
all vertices are only self-connected. In Fig. 4, we demonstrate
the flexibility of this approach by showing the implementation
of a variety of different GNN models.
V. THE GRIP ARCHITECTURE
GRIP is an accelerator architecture for low-latency GNN
inference. Rather than designing around a specific GNN, GRIP
allows users to customize the architecture by implementing
four processing elements (PEs) corresponding to each UDF
of GReTA. This allows GRIP to be used to accelerate a wide
variety of models. In this section, we describe an overview of
GRIP and the microarchitecture of each execution unit. A high
level overview of GRIP is shown in Fig. 5.
A. Overview
Control. GRIP is controlled by a host system that sends
commands to execute different operations or transfer data. The
control unit dequeues each command in-order and issues them
asynchronously to individual execution units or the memory
controller. Additionally, almost all buffers use double-buffering
to allow overlapping the execution of different operations with
moving data between buffers or loading from off-chip. A barrier
command is used to enforce dependencies by preventing new
4
Ho
st Control
SRAM
Accumulator
Control
UnitCmd. FIFO
Status
Off-Chip
DR
AM
Nodeflow
Buffer
…
Edges
FeaturesBa
nk
Edges
FeaturesBa
nk
M
em
or
y C
on
tro
l
Global Weight Buffer Tile Buffer
Edge
Accum.
Bank
Bank
…
Vertex
Accum.
Bank
Bank
…
Vertex
Unit
Transform 
PE
Transform 
PE
…
Edge Unit
… …
Pre-
fetch
Pre-
fetch
Cr
os
sb
ar
Update
Unit
Activate
PE
Activate
PE
…
Weight Sequencer
Gather
PE
Reduce
PE
Gather
PE
Reduce
PE
Custom PE
…
Fig. 5: High-level overview of GRIP.
commands from being issued until all previous commands have
completed. Each command also updates a global status register
on completion, which can be queried by the host to monitor
execution.
Execution Units. GRIP has three core execution units: the
edge unit, the vertex unit, and the update unit. The edge unit
performs the edge-accumulate phase by iterating over the edges
of the nodeflow, which is stored in the nodeflow buffer. The
edge unit then reads the associated features, executes gather,
and finally accumulates the result into the edge accumulator
using reduce.
The vertex unit performs the vertex-accumulate phase by
iterating over the output vertices corresponding to the accu-
mulated edge values. It then executes transform, accumulating
the result into the vertex accumulator. The vertex unit also
reads weight values from the tile buffer, which caches tiles of
weight values from the global weight buffer. To synchronize the
tile buffer and the vertex unit, the weight sequencer controls
iterating over the tiles as described in Section VI-B. Since
weight values are shared across all nodeflow output vertices,
the global weight buffer is only required to be loaded once at
the beginning of a GRIP program.
Finally, the update unit performs the vertex-update phase by
reading the accumulated values for each vertex and passing
the values to the activate PE. The result is written to the
nodeflow buffer as an updated feature, or to the edge or vertex
accumulator. This allows efficiently passing values between
different GRIP programs when they are executed in sequence.
PE Implementation. GRIP allows users to customize four
PEs corresponding to the UDFs introduced in Sec. IV. These can
be implemented in multiple ways depending on the user’s needs.
For example, they could be implemented using a reconfigurable
fabric (e.g. an FPGA) for maximum flexibility. Alternatively,
they could be implemented using a model specific circuit to
optimize for area or performance.
Our implementation uses a programmable ALU based
approach. Since most common GNNs only require a small
number of operations in practice, this allows us to support
a range of models on the same hardware while remaining
reasonably efficient in practice. Specifically, we allow gather
to be identity (e.g. hu or hv), element-wise sum, product, or
scale by constant; reduce to be element-wise sum, max, or
mean; transform to be matrix multiplication followed by
element-wise sum; and activate to be either ReLU or a
LUT operation which we describe in Sec. V-D. While these
cover most GNN models we investigated, expanding the set
of supported operations may be required for other GNNs. We
leave exploring other possible implementations for future work.
Memory Controller. The memory controller is responsible
for moving data on- and off-chip. Instead of each execution
unit issuing requests to the memory controller directly, the
host is required to statically schedule memory transfers before
execution. This is possible since the set of features required
for inference can be easily determined from the nodeflow. This
also prevents individual units from stalling on external memory
accesses, but requires scheduling commands such that loading
data fully overlaps with execution (Sec. VI-A).
B. Edge Unit
Reduce LanePrefetch Lane
Iterate
Outgoing
Edges
Read 
Source
Feature
Dequeue
Source
Vertex ID
Cr
os
sb
ar
Gather
PE
Read
Dest.
Feature
Reduce
PE
P0 P2P1 R0
Read
Edge
Accum.
Write
Edge
Accum.
R1 R2 R3 R4
Fig. 6: The edge unit pipeline is split into source-oriented
(P0-P2) and destination-oriented (R0-R4) sections called lanes
that can be independently replicated. Stage R0 is only used
for models that require reading source features.
The edge unit pipeline is split into two distinct halves (Fig. 6).
Stages P0-P2 implement prefetch, which iterates over the edges
of the nodeflow and reads the features corresponding to the
source vertex. The result is passed to reduce (stages R0-R4),
which reads the corresponding destination feature and then
applies gather and reduce, accumulating the result into
the edge accumulator. GRIP allows optionally disabling stage
R0 since most models do not require reading source features.
Parallelization. The edge-accumulator value for each output
vertex can be computed independently. This means there is
a significant amount of parallelism that can be exploited to
improve the performance of the edge unit. A simple method to
5
parallelize execution is to duplicate the elements of the edge
unit into N identical copies. Each copy can then be assigned
a subset of output vertices to process in parallel (e.g. by a
random hash of the vertex ID). However, since the nodeflow
buffer is read every cycle by each lane, this approach requires
adding 2N ports to the nodeflow buffer.
Instead, GRIP duplicates prefetch and reduce into N and
M copies called lanes. Each lane is statically assigned a
partition of input vertices (for prefetch) or output vertices
(for reduce). Similarly, edges are assigned to a prefetch lane
based on the edge’s source vertex. During execution, each
prefetch lane iterates over its assigned edges and reads each
edge’s corresponding feature. It then sends the feature data
through an N ×M crossbar to the reduce lane assigned to the
destination vertex. This design restricts each lane to accessing
only its assigned subset of features and edges, allowing GRIP
to partition the nodeflow buffer into N +M separate SRAMs.
As a result, GRIP scales to a much larger number of lanes than
the simpler design. Additionally, our implementation of GRIP
extends this scheme to include off-chip memory by storing
feature data pre-partitioned and setting the number of prefetch
lanes equal to the number of DRAM channels.
C. Vertex Unit
The vertex unit implements the vertex-accumulate phase by
iterating over the output vertices and applying transform.
Our implementation restricts transform to a matrix multipli-
cation, which we implement using a 16× 32 weight stationary
PE array [9]. Each PE contains a 16-bit multiplier, as well as
a local double buffered weight register. The PE array is broken
into two 16× 16 blocks. Blocks can be configured to use one
of two modes: cooperative, where both blocks operate on the
same vertex, or parallel, where blocks operate on different
vertices in parallel. Parallel mode broadcasts weight values to
both blocks, allowing for slightly lower energy consumption
at the expense of higher latency when there is only a single
output vertex.
Unlike many other neural network accelerators, GRIP does
not use a systolic array structure. Instead, GRIP broadcasts
inputs across the rows of the array and accumulates results
down columns using a reduction tree. The entire operation is
pipelined to allow multiple matrix operations to occur without
stalling, even as weights are transferred in and out of the array.
This results in a significant savings in latency for a single
matrix-vector operations; instead of requiring 16 + 32 = 48
cycles, our implementation requires just six (three to distribute
values, one for multiplication, and two for reduction). This also
eliminates the buffers required for input skewing in a systolic
design.
D. Update Unit
The update unit iterates over each vector in the vertex
accumulator and applies activate. Our activate PE allows
two possible operations: element-wise ReLU and a two-level
configurable lookup-table (LUT) that can be used to approxi-
mate many activation functions. Each LUT level is implemented
A
B
C
D
E
F
A
B
C
D
E
(a)
A
B
C
D
E
F
A B C D
U1
U2
U3
E
V1 V2
1,1
3,2
(b)
V1
U1
U3
U2
Time
V2
U2
LD
LD EA
LD
LD
EA VA VU
EA
LD
LD EA VA VU
(c)
Fig. 7: An example of a nodeflow (a) and corresponding
partitions (b) which are processed column-wise. GRIP also
pipelines transferring feature data with execution (c).
as a separate table with 33 and 9 entries, respectively. Both
cover overlapping ranges of input: the first level from −2a to
2a, and the second level from −2b to 2b, where a and b are
user configurable values. The LUT entries lineally partition
the range, e.g. entry 0 of level 1 corresponds to −2a, entry 1
corresponds to −2a + 2a+1/32, etc. To perform an activation
computation, the input is first converted to a 16-bit fixed point
representation with 4-bits of integer precision. Each level is
then checked in series to see if the input falls in its range.
If so, the closest two LUT values are linearly interpolated
to produce an output. If the values overflow the range for
both levels, the input is either clamped to the closest value
in the second level, or a user configured linear function is
used. Additionally, the overflow behavior can be configured for
both positive and negative inputs independently, allowing the
implementation of non-symmetric activation functions. This
simple approximation covers a large number of activation
functions, including sigmoid, which is required for models
such as G-GCN.
VI. OPTIMIZATIONS
GRIP implements two major GNN optimizations: execution
partitioning and vertex-tiling. Execution partitioning describes
a method to split a GRIP program to operate on partitions of
a nodeflow, reducing the amount of on-chip memory required.
GRIP supports pipelining operations on different partitions,
improving performance. Vertex-tiling improves the locality of
weights the vertex-accumulate phase, reducing the memory
bandwidth required by the vertex unit. Collectively, these
optimizations reduce inference latency for GRIP by a significant
factor.
A. Execution Partitioning
A common GNN optimization is to split the graph into
partitions that can be computed on separately [27], [31], [45].
This reduces the peak amount of on-chip memory required to
compute inference since only a portion of the graph must be
loaded at once. GRIP supports a similar optimization we refer
to as execution partitioning, shown in Fig. 7. First, the user
partitions the nodeflow offline by splitting the input and output
vertices into fixed chunks of size N and M . Likewise, the
edges are partitioned into blocks of size N ×M , where block
NFi,j stores the edges connecting input vertices in chunk Ui to
output vertices in chunk Vj . During inference, GRIP executes
6
edge-accumulate for each partition in a column, skipping blocks
that are empty. Then, GRIP executes the vertex-accumulate and
vertex-update phases once, updating values in the corresponding
partition of output vertices. This ensures every incoming edge
for each output vertex is processed before vertex-accumulate.
Another advantage of execution partitioning is that operations
can be pipelined between partitions. GRIP implements two
kinds of pipelining related to partitioning. First, GRIP pipelines
loading data from off-chip with the edge-accumulate phase.
This allows overlapping execution with bulk loading feature
data for an entire partition. If enough space is available in
the nodeflow buffer, GRIP also optionally caches partition
feature data loaded during the processing of the first column
to avoid reload data while processing later columns. Second,
transferring weights from the global buffer can be pipelined
with processing an entire column. GRIP performs inter-layer
pipelining by loading the weights of the next layer while
processing the last column, and preloads the tile buffer before
processing the first column.
B. Vertex-Tiling
The bandwidth required to load layer weights can be a
significant bottleneck. For example, consider a GCN layer with
a feature size of 256. Since our implementation of transform
cannot hold the entire 1MB weight matrix locally, new weight
values must be loaded every cycle. At an operating frequency
of 1GHz, this requires a maximum of 2TB/s of tile buffer
bandwidth, which we found difficult to implement physically.
While this could be resolved by increasing the number of
weights stored within the multiplier array, this increases energy
usage and lacks flexibility since a model with a larger feature
size would still run into the same limitation.
A
B
C
D
E
F
Features Tile BufferEdge Accum. Vertex Accum.
F
M =X
O
of
m
Fig. 8: Vertex-tiling allows materializing a small tile of edge
accumulator values (m× f ) instead of the full M × F matrix.
This reduces the memory bandwidth required since a tile of
weight values can be reused across m vertices.
GRIP’s approach is to instead use an optimization we call
vertex-tiling. The key insight of vertex-tiling is that in almost
all cases transform is affine, which allows us to perform
an optimization similar to tiling matrix multiplication. Fig. 8
shows a graphical representation of this strategy. Here, edge-
accumulate produces f elements for m output vertices. This
requires storing f×m elements in the edge accumulator instead
of the full F ×M matrix. Then, we run vertex-accumulate
(in this case matrix multiplication), which loads each f × o
tile from the tile buffer. We then repeat this process, first for
all vertex tiles and then for all weight slices, maximizing the
locality of the weights. This reduces the bandwidth between
the tile buffer and the matrix unit by a factor of 1/m. Thus, by
tuning f and m we can trade-off the required bandwidth with
the amount of storage required for tiles and edge-accumulate
values.
VII. EXPERIMENTAL METHODOLOGY
Datasets. Table I describes the properties of the datasets
chosen for evaluation. Datasets were selected from previous
evaluations of GNNs [21], the SNAP project [29], and the UF
sparse matrix collection [12]. Included datasets were designed
to be similar to the workloads used by GNNs, as well as provide
a range of connectivity. We prepossessed each dataset using the
same procedure outlined by the authors of GraphSAGE [21].
The column “2-Hop” denotes the median number of unique
vertices within the 2-hop neighborhood of a vertex picked
uniformly at random from the graph, taking into account the
sampling procedure.
TABLE I: Datasets used for evaluation.
Dataset Nodes Edges 2-Hop
Youtube (YT) 1,134,890 2,987,624 25
Livejournal (LJ) 3,997,962 34,681,189 65
Pokec (PO) 1,632,803 30,622,564 167
Reddit (RD) 232,383 47,396,905 239
Models. We implemented four GNNs which cover a broad
range of different model types: GCN [28], the max variant of
GraphSage [21], GIN [44], and G-GCN [2], [5], [33]. For our
neighborhood function, we use the same sampling proceedure
as described by the authors of GraphSage. Specifically, we
deterministically map a given vertex to a fixed-sized, uniform
sample of its neighbors. For all models, we use two layers with
sample sizes 25 and 10 for the first and second layer, respec-
tively. Samples between layers are independent. Additionally,
we use a feature size of 602 (the feature size of the Reddit
dataset), a hidden dimension of 512, and an output dimension
of 256 for all layers.
Baseline. Our CPU baseline was a dual socket server
containing two, 14-core 2.60GHz Intel Xeon E5-2690 v4
CPUs, each with four channels of DDR4-2400 memory. We
restricted our experiments to a single socket to adhere to
Tensorflow performance guidelines [18] and to avoid latency
variation resulting from NUMA. In this configuration, we
measured a sustained 1.084TFlop/s in a matrix multiply
benchmark (93% of 1.164TFlop/s theoretical peak) and
64.5GiB/s of off-chip memory bandwidth (84% of 76.8GiB/s
theoretical peak).
We implemented both the baseline and our optimized
inference algorithm in Tensorflow v2.0 [1] with eager mode
disabled and compiled with the Intel Math Kernel Library [23].
To discount the overhead of the Tensorflow library for each
model, we measured the time to evaluate an equivalent model
with all tensor dimensions set to zero and subtract it from the
latency measurement. We also perform a warm-up inference
before all measurements to allow Tensorflow to compile and
optimize the network.
7
TABLE II: Architectural characteristics of baseline and GRIP.
CPU GRIP
Compute 1.164TOP/s@ 2.6GHz
1.088TOP/s
@ 1.0GHz
On-chip
memory
L1D: 14× 32 KiB
L2: 14× 256 KiB
LLC: 35MiB
Nodeflow: 4× 20 KiB
Tile: 2× 64 KiB
Weight: 2MiB
Off-chip
memory
4× DDR4-2400
76.8GiB/s
4× DDR4-2400
76.8GiB/s
Total Area 306.18mm2 11.27mm2
Power 135W 4.9W
ASIC Synthesis: We implemented GRIP in SystemVerilog,
choosing the architectural parameters to have similar compute
and memory bandwidth as our CPU baseline (Table II).
The implementation uses 16-bit fixed point, which maintains
suitable inference accuracy in the models we evaluate. We then
performed synthesis and place and route in a 28 nm CMOS
process, targeting a 1GHz operating frequency and worst case
PVT corner. The critical path of GRIP was determined to
be 0.93 ns, inside the weight SRAMs. Power estimates of
each unit was performed by generating activity factors from a
cycle accurate simulation of our implementation and applying
them to our synthesized design. We used Cacti v6.5 [39] to
estimate the area and power of the SRAM memories. We also
integrated Ramulator [26] into our simulator to estimate DRAM
timings and produce a command trace. These traces were fed
to DRAMPower [6] to estimate DRAM power.
VIII. EVALUATION
GRIP aims to accelerate GNN inference for a wide range of
models, specifically targeting low latency. We evaluate this by
measuring overall inference latency for four different models
and compare to a CPU and GPU baseline (Sec. VIII-A). To
better understand GRIP’s performance, we then breakdown
the contribution of each architectual feature (Sec. VIII-B)
and how the overall speedup changes as we modify both
architectural (Sec. VIII-C) and model parameters (Sec. VIII-D).
We also measure the impact of each GNN optimization we
implemented (Sec. VIII-E). Finally, we compare GRIP to
alternative approaches (Sec. VIII-F) and present a breakdown
of energy consumption during inference (Sec. VIII-G).
A. Overall Performance
To evaluate GRIP’s overall performance, we measured the
total end-to-end execution time (latency) to compute inference
with a variety of models and datasets. Table III shows GRIP’s
inference latency and speedup versus our CPU and GPU
implementation. We use 99th percentile latency for consistency
with prior evaluations of inference performance [38].
Performance vs. CPU. Compared to our CPU implementa-
tion, GRIP achieves a latency improvement of between 29.8×
(GCN, Pokec) and 10.9× (GIN, Pokec) with a geometric mean
of 17.0× across all datasets and models. GRIP tends to give a
smaller speedup on models that perform a larger portion of their
TABLE III: 99%-ile inference latency for GRIP, CPU, and
GPU
CPU GPU
Model Dataset GRIP µs × µs ×
GCN
youtube 15.4 309.2 (20.1) 1082.4 (70.5)
livejournal 15.8 466.8 (29.5) 1313.6 (83.1)
pokec 16.0 477.1 (29.8) 1085.6 (67.7)
reddit 16.3 407.1 (25.0) 813.2 (50.0)
G-GCN
youtube 134.1 2315.9 (17.3) 1332.5 ( 9.9)
livejournal 146.3 2493.2 (17.0) 1837.6 (12.6)
pokec 146.7 2637.9 (18.0) 1409.2 ( 9.6)
reddit 147.0 2864.2 (19.5) 1133.9 ( 7.7)
GS
youtube 113.7 1545.1 (13.6) 1309.0 (11.5)
livejournal 124.4 1947.4 (15.7) 2193.8 (17.6)
pokec 124.9 2075.7 (16.6) 1759.1 (14.1)
reddit 125.3 2099.0 (16.8) 1252.8 (10.0)
GIN
youtube 30.5 344.7 (11.3) 1387.6 (45.5)
livejournal 30.9 416.1 (13.5) 1221.5 (39.5)
pokec 31.1 340.7 (10.9) 855.5 (27.5)
reddit 31.4 354.8 (11.3) 1009.4 (32.2)
computation during the Update step of the message-passing
layer. For example, GIN’s Update uses a two-layer MLP that
requires roughly double the computation of GCN’s single
matrix multiplication. However, the additional computation
results in similar overall CPU inference latency since our
implementation is largely bottlenecked by non-computational
factors (Sec. II-B). This results in GRIP achieving a smaller
performance improvement of 10.9-13.5× compared to an
improvement of more than 13.6× for all other models.
Performance vs. GPU. Practical deployments of online
GNN inference most often use CPUs due to the large memory
requirements for graph features and low utilization at small
batch sizes. However, for completeness we also benchmark
GRIP against an Nvidia P100 GPU implementation for each
model. GRIP’s speedup on GPU ranges from 83.1× (Live-
journal, GCN) to 7.7× (Reddit, G-GCN) with a geometric
mean of 23.4×. For models with relatively low overall latency
(GCN, GIN) we see a significantly higher speedup than
with our CPU implementation. This is largely due to the
overhead of transferring embeddings from host to GPU memory
(roughly 200-500 µs, depending on the neighborhood size)
which comprises a large portion of the overall execution time
for models like GCN (25%-50% of total latency). GRIP does
not incur this penalty since features and weights are already
stored in device DRAM and do not have to be transferred
from the host. On models with a higher total execution time
(e.g. G-GCN), GRIP still achieves a significant speedup due
to low GPU utilization. With a batch size of 1, there is not
sufficient computation during each layer to fully utilize the
computational resources of the GPU and overhead of launching
each kernel tends to dominate.
8
0 5 10 15 20
Speedup
+ Other
+ Vert. Unit
+ Edg. Unit
+ Split Mem.
Baseline
19.5
19.2
10.3
3.0
1.0
(a) Speedup breakdown for each
component of GRIP versus baseline.
0 4 8 12 16 20
Speedup
GRIP
TPU+
HyGCN
Graphici.
Baseline
19.5
11.3
4.4
2.4
1.0
(b) Estimated speedups of
prior work versus baseline and
GRIP.
Fig. 9: Breakdown of performance improvements.
B. Breakdown of Performance
In this subsection, we breakdown the performance impact of
each architectural feature of GRIP. Specifically, we modify our
cycle-accurate simulator to match the bottlenecks exhibited by
our CPU implementation and then progressively remove each
modification to measure the impact of different units. As a
performance benchmark, we use the geometric mean speedup
of GCN for the largest neighborhood in each dataset.
Baseline Configuration. Our baseline configuration em-
ulates each core being assigned independent vertices and
performing all GReTA phases, with weights and partition
data being first loaded into L3 cache and intermediate values
accumulated directly in L2. This results in the following
simulator modifications. First, we modify our vertex unit to
use 14, 8× 2 matrix multiply units, with each unit assigned
independent vertices within a partition. This emulates the
effect of each CPU core using two 8-element SIMD units.
Second, we increase the number of fetch and gather units to
14 and the crossbar width to 32 bytes, matching the number of
cores and L2 cache bandwidth, respectively. We also disable
pipelining between the edge and vertex units to emulate a
single core performing both functions. Third, we merge the
weight and nodeflow buffers into a single SRAM and limit
the maximum read bandwidth to 16 bytes per cycle per fetch
unit, matching the bandwidth of the L3 cache. Finally, we
disable pipelining between the vertex and update unit to
model both operations being performed by the same core.
This configuration overestimates the performance of the CPU
in practice since it models ideal performance and no additional
computation required for auxiliary operations, such as indexing
calculations. In particular, with a 2.6GHz clock and an element
width of 4-bytes, our model is 2.07× faster than the measured
CPU latency.
Breakdown. In Fig. 9a, we show the impact of different
units in GRIP by progressively removing each modification
from our baseline in reverse order. First, we split the weight
and nodeflow memories into separate SRAMs. This results in
a 2.8× speedup due to removing contention between fetching
features and weights from the same SRAM (2.0×), as well
doubling the bandwidth available to load weight values into
the vertex unit (1.4×). Second, we add the edge unit, resulting
in an additional improvement of 3.4×. While this is partially
1 2 4 8 16
# DRAM Chan.
(a)
0.5
1
1.5
Sp
ee
du
p
32 128 512
Weight GiB/s
(b)
0.5
1
1.5
8 32 128
XBar Width
(c)
0.5
1
1.5
¼ ½ 1 2 4
Mat. TOP/s
(d)
0.5
1
1.5
Fig. 10: Impact of scaling architectural parameters. Dashed
vertical line indicates our implementation’s parameters. In
Fig. 10a the number of edge unit lanes is kept equal to the
number of channels.
due to increased crossbar bandwidth after adjusting the number
of fetch and gather units (1.14×), the majority of the speedup
is due to allowing loading data, edge-accumulate, and vertex-
accumulate phases to overlap by using a dedicated unit for
each phase (2.97×). Third, we enable the vertex unit and revert
to using a single 16 × 32 matrix multiply unit, resulting in
an additional 1.87× speedup. This is due to increased overall
TOP/s (1.63×) and using a single unit rather than multiple
units, which allows units to not be wasted when the overall
number of output vertices is small (1.15×). Finally, separating
and pipelining the update unit produces a small speedup of
1.02×.
C. Architectural Parameters
Here, we discuss the impact of several high level architectural
parameters on inference performance.
Number of DRAM Channels. The number of DRAM
channels determines the overall memory bandwidth available
to transfer data on- and off-chip. In Fig. 10a, we observe
that GRIP’s performance is strongly related to the number of
channels until around 8 channels (∼150GiB/s). This indicates
that GRIP’s performance is primarily limited by off-chip
memory bandwidth.
Weight Bandwidth. The weight bandwidth determines how
many values can be read from the global weight buffer each
cycle. If this is set too low, loading weight values can become
a bottleneck during vertex-accumulate. We observe this effect
in Fig. 10b below 128GiB/s, which corresponds to loading
64 weight values each cycle.
Crossbar Port Width. The crossbar port width determines
the number of elements accumulated by each gather unit in a
single cycle. In our experiments, the average number of edges
per vertex is fairly small (sampled to be less than 25). Since
edge-accumulate typically takes much less time than vertex-
accumulate or loading data from DRAM, increasing the width
has a limited impact on performance (Fig. 10c). However, it
is preferable to over-allocate the crossbar width in order to
ensure high performance even on dense nodeflows.
Matrix Multiply TOP/s. The total number of TOP/s GRIP
can achieve is determined primarily by the size of the matrix
multiply unit. In Fig. 10d we see that performance is strongly
related to the size of this unit, until reaching around 2TOP/s at
which point GRIP is limited by memory bandwidth. Thus, our
9
1 32 64 128 256 512 1K 2K 4K
Dimension Size
0
25
50
75
100
%
 T
ot
al
 T
im
e
M
at
M
ul
Input
Output
(a) Impact of feature dimensions.
1 2 4 8 16 32 64128
# Sampled Edges
0
25
50
75
100
%
 T
ot
al
 T
im
e
Ga
th
er
(b) Impact of sampling.
Fig. 11: The impact of scaling different GCN parameters on
the balance of time spent in each operation. Scaling the output
feature size increases the amount of time spent performing
matrix multiplication, while increasing the number of edges
decreases it.
implementation of GRIP would see a relatively small benefit
from a substantially larger matrix unit (1.14× for a 4× larger
unit).
D. Model Parameters
A key aspect GRIP’s design is balancing the performance
between GReTA’s edge and vertex-centric phases. Here, we
evaluate how this balance changes as the parameters of the
GNN model are altered.
Feature Dimensions. In Fig. 11a, we evaluate how varying
the number of the input and output features impacts the percent
of time spent in matrix multiplication. The proportion is initially
low (∼8%) for small features and increases linearly until
32 features. This is due to the fact that when the feature
size is smaller than the native width of the DRAM interface,
DRAM bandwidth is poorly utilized due to many random
accesses. In our implementation, we use two dual-channel
DRAM controllers, which each have an interface of 64 2-
byte elements. Above this point, the proportion of time spent
performing vertex-accumulation stays flat, reflecting the fact
that each additional feature results in a constant amount of
additional computation during inference. However, this analysis
does not hold for the output features, which can be increased
without needing to increase the number of values loaded from
DRAM. We see that increasing the output feature size always
increases the percent of time performing vertex-accumulate.
Thus, models with large output feature sizes are likely to be
limited by compute rather than memory.
Sampled Edges. Another important model parameter is
the number of sampled edges per output vertex. In Fig. 11b,
we evaluate how the number of edges impacts the percent
of total time spent performing edge-accumulate. For less
than 8 edges per vertex, GRIP’s performance is mostly
limited by computation and overhead related to accessing
data from DRAM. Above this threshold, the memory and
crossbar bandwidth becomes a bottleneck, and GRIP spends
an increasing portion of execution loading data.
Neighborhood Size. The neighborhood size heavily impacts
GRIP’s overall latency and is influenced by local graph
structure. To demonstrate the impact of the neighborhood
50 100 150 200 250
Neighborhood Size
1.0
1.2
1.4
1.6
1.8
No
rm
al
ize
d 
La
te
nc
y 99thmedian
min
(a)
50 100 150 200 250
Neighborhood Size
15
20
25
30
35
Sp
ee
du
p
(b)
Fig. 12: Impact of different neighborhood sizes on latency
for the GCN model. GRIP’s latency linearly increases with
neighborhood size due to more computation being required for
inference. The speedup is roughly constant until a neighborhood
size of about 95, at which point intermediate values no longer
fit into the cache of a single CPU core.
size on performance, we plot GRIP’s minimum, median, and
99th percentile inference latency for GCN across different
neighborhoods of the LiveJournal dataset in Fig. 12a. The
result is a strong linear relationship between the neighborhood
size and latency across the entire distribution. Each vertex added
to the neighborhood results in a roughly constant increase in
the amount of work during inference. Additionally, we observe
that as the neighborhood size increases, the median latency
moves closer to the 99th percentile. This is the result of larger
neighborhoods being more likely to be densely connected,
leading to a larger number of reductions that must be computed.
In Fig. 12, we evaluate the latency speedup compared to
the CPU baseline across different neighborhood sizes. Below
a neighborhood size of 95, we see a roughly constant speedup
of between 12× and 18×. For these neighborhood sizes, all
intermediate values fit into the L1 and L2 cache of a single
CPU core. After this point, some feature values must be stored
in the L3 cache and inference performance becomes limited
by the cache bandwidth (Sec. II-B).
E. Optimizations
0 1 2 3
Speedup
+ Weight
Preload
+ Pipeline
+ Cache
Baseline
2.5
1.7
1.3
1.0
(a) Impact of pipelining
1 4 8 12 16
# Vertices
32
64
128
256
Ti
le
 S
ize
0.6 2.1 3.1 5.2 4.9
1.1 3.8 5.4 8.0 7.3
1.1 3.7 5.3 8.0 7.3
1.1 3.7 5.2 7.4 6.6
(b) Speedup of vertex-tiling
Fig. 13: Impact of partitioning and tiling optimizations.
In this subsection, we evaluate the impact of each optimiza-
tion used by GRIP.
Partitioning and pipelining. In Fig. 13a we show the cu-
mulative speedups of each optimization enabled by partitioning.
10
We compare to an unoptimized baseline, where feature values
are loaded from off-chip on demand and no pipelining exists
between stages. First, by caching feature data on-chip, GRIP
achieves a 1.3× speedup. This is due to the decreased memory
traffic required to reload data between partition columns
and by the improved throughput from bulk loading data for
an entire partition. Second, pipelining operations between
different partitions results in an additional 1.3× speedup due
to overlapping execution with memory transfers. Finally, we
can also pipeline the transfer of weights from the global weight
buffer into the vertex-update unit. This increases the overall
speed-up to a total of 2.5×.
Vertex-Tiling. In Fig. 13b, we show the speedup compared
to no tiling as we alter the two tiling parameters M (the number
of vertices in a tile) and F (the number of elements per vertex).
We see that performance generally reaches a maximum around
F = 64 elements. Above F = 64, increasing F causes the
vertex unit to stall more often while waiting for a tile to be
produced by the edge unit. The performance degradation is
not linear because the time taken to accumulate a tile depends
on the connectedness of the nodeflow. We also see degraded
performance below F = 64. This is because F features
are loaded from memory for each vertex. As F decreases,
more random DRAM accesses are required to load features,
degrading DRAM throughput. Increasing M also increases
performance until around 12 vertices. The maximum number
of output vertices in our model is 11. Increasing M beyond
this only adds additional latency associated with processing
empty dummy vertices.
F. Comparisons to Prior Work
Several other approaches have been proposed to accelerate
neural networks and graph algorithms. Here, we analyze the
bottlenecks present in each approach and compare performance
with GRIP.
HyGCN. HyGCN [45] is an accelerator designed for graph
convolutional networks, a subset of GNNs that do not have
computation associated with edges. HyGCN and GRIP take
a similar approach of using separate units for edge- and
vertex-centric operations. However, GRIP addresses two major
bottlenecks present in the HyGCN design.
First, HyGCN uses 32 8-lane SIMD units to perform edge-
oriented operations, but can only issue a single edge at a time.
This means the throughput of edge operations will be limited
when the number of features is smaller than the total number
of SIMD lanes. In contrast, GRIP allows for multiple edges to
be issued in parallel.
Second, HyGCN requires an entire feature vector to be
computed and stored before performing vertex-oriented oper-
ations. In order to process multiple vertices in parallel, this
requires a large buffer to store accumulated values (16MB in
the HyGCN implementation). The size of this buffer is also
reported by the HyGCN authors to have a significant impact
on their overall performance (1.3 – 4× worse performance for
a 16× smaller buffer.) In contrast, GRIP uses vertex-tiling to
only store a small number of elements from multiple feature
vectors. This allows GRIP to use a roughly 10, 000× smaller
buffer (1.5KiB) while achieving comparable performance.
We demonstrate these limitations by modifying our simulator
to emulate the HyGCN approach. Specifically, we set the
number of gather and fetch units to 1 and the crossbar width
to 256 to match the number of SIMD lanes. We disable all
tiling and force feature vectors to be fully accumulated before
vertex-accumulate. We then set all other parameters to be the
same as GRIP, including the same partitioning used in our
evaluation of GRIP.
This configuration results in a speedup of 4.4× the baseline,
shown in Fig. 9b. However, it performs 4.5× slower than
GRIP due to limits in the available on-chip memory bandwidth
for weights. Incorporating vertex tiling would allow for a
much smaller edge accumulate buffer and reduce the required
bandwidth by increasing the reuse of the weights.
Modified TPU. The TPU [25] is a DNN accelerator designed
around a large 2-D systolic array. Unfortunately, GNNs are
difficult to implement efficiently for the TPU due to a lack of
support for edge-oriented operations [3]. Instead, we compare
GRIP to a modified version of the TPU architecture that
addresses this limitation by incorporating features from GRIP.
We refer to this modified design as the TPU+.
Specifically, the TPU+ has an additional unit similar to
GRIP’s edge-unit between the TPU’s unified buffer and the
systolic data setup. This allows the TPU+ to natively support
the GReTA programming model by mapping edge-accumulate
onto the new edge-unit, vertex-accumulate onto TPU’s systolic
array, and vertex-update onto the activation pipeline. This
design also supports both the execution partitioning and vertex-
tiling optimizations described in Sec. VI.
We estimate the performance of the TPU+ by modifying our
cycle-accurate model to use a single fetch and gather unit. We
also replace the vertex-unit with an identically sized 16× 32
systolic array. As in the original TPU design, weights are
stored off-chip and the dedicated weight bandwidth is limited
to 30GiB/s. All other parameters remain unchanged compared
to our evaluation of GRIP, including the use of 4× DDR4-2400
for off-chip memory and the same partitioning and vertex-tiling
optimizations.
This configuration achieves a 11.3× speedup (Fig. 9b)
compared to our baseline in Sec. VIII-B. The main bottleneck
in this approach is the limited bandwidth dedicated to weights.
Moving weights on-chip as in GRIP results in a 1.72× speedup.
Higher performance memory for weights (e.g. HBM as used by
later versions of the TPU) could also address this bottleneck.
However, we leave a fuller exploration for future work.
Graphicionado. Graphicionado [20] is an accelerator archi-
tecture designed for graph analytics. Like GRIP, Graphicionado
allows several units to be specialized for a particular algorithms,
such as GCN inference. However, it is designed for algorithms
that use a small amount of state per-vertex. As a result, it
suffers from two bottlenecks. First, like HyGCN, it cannot
perform vertex-tiling since it requires full feature vectors to be
accumulated. This results in a bottleneck similar to HyGCN
since weight data cannot be easily reused between different
11
vertices. Second, each lane has independent vertex units instead
of using a single shared unit, increasing the required weight
bandwidth by an amount proportional to the number of lanes.
We estimate the impact of these bottlenecks by modifying
our simulator by disabling tiling and splitting the vertex unit
into two units lanes that share a single tile buffer port. We
also use the same partitioning scheme used for GRIP. This
configuration results in a small speedup of 2.4× over the
baseline, shown in Fig. 9b. However, this is 8.1× slower than
GRIP due to significant bottlenecks in weight bandwidth.
G. Energy
Table IV shows the power consumption for each of GRIP’s
core top level modules during GCN inference. The single
most energy intensive during inference is loading embeddings
from DRAM, consuming more than the rest of the accelerator
combined (53.7%). This is due to the fact that both the number
of vertices and the feature size is the largest at the input of
GCN, leading to more data being initially loaded in the first
layer. Additionally, GRIP optimizes for latency with four high
performance DRAM channels, requiring a large amount of
energy per transfer. The rest of the energy is mostly used by
loading weights from the global weight and nodeflow buffers.
Both are fairly large, leading to a high energy cost per read and
write. In total, GRIP uses just 4.9W, a significant improvement
over the 135W TDP of the baseline CPU.
TABLE IV: Breakdown of power for GCN inference.
Module mW (%)
Execution
Units
Edge 4.1 0.1
Vertex 656.6 12.6
Update 0.4 < 0.1
SRAM Weight 1476.7 28.3Nodeflow 269.5 5.1
DRAM - 2794.7 53.7
Total 4932.4 100
IX. CONCLUSION
GNNs represent a promising new method in machine learning
to learn directly from graph-structured data. However, the
computational costs of GNNs represent a significant barrier for
deployment in many applications, especially in the scenario of
online inference.
This paper presents GRIP, an accelerator architecture de-
signed for low latency GNN inference. GRIP splits GNN
operations into a series of edge- and vertex-centric phases. Each
phase is implemented independently in hardware, allowing for
specialization of both the memory subsystem and execution
units to improve performance. Additionally, GRIP has hardware
support for several optimizations that further reduce latency,
including pipelining operations between nodeflow partitions
and vertex-tiling. We then implement GRIP as 28 nm ASIC
capable of executing a range of different GNNs. On a variety
of real graphs, our implementation improves 99th percentile
latency by a geometric mean of 17× and 23× compared to a
CPU and GPU baseline, respectively, while drawing only 5W.
REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vie´gas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow:
Large-scale machine learning on heterogeneous systems,” 2015, software
available from tensorflow.org. [Online]. Available: http://tensorflow.org/
[2] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to
represent programs with graphs,” in 6th International Conference
on Learning Representations, ICLR, 2018. [Online]. Available:
https://openreview.net/forum?id=BJOFETxR-
[3] M. Balog, B. van Merrie¨nboer, S. Moitra, Y. Li, and D. Tarlow, “Fast
training of sparse graph neural networks on dense hardware,” arXiv
preprint arXiv:1906.11786, 2019.
[4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zam-
baldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner
et al., “Relational inductive biases, deep learning, and graph networks,”
arXiv preprint arXiv:1806.01261, 2018.
[5] X. Bresson and T. Laurent, “Residual gated graph convnets,” arXiv
preprint arXiv:1711.07553, 2017.
[6] K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji,
B. Akesson, N. Wehn, and K. Goossens, “DRAMPower: Open-source
dram power & energy estimation tool,” URL: http://www.drampower.info,
vol. 22, 2012.
[7] J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graph con-
volutional networks via importance sampling,” International Conference
on Learning Representations (ICLR), 2018.
[8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“DianNao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proceedings of the 19th International Conference
on Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 269–
284. [Online]. Available: http://doi.acm.org/10.1145/2541940.2541967
[9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
2017.
[10] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning
supercomputer,” in Proceedings of the 47th Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO-47.
Washington, DC, USA: IEEE Computer Society, 2014, pp. 609–622.
[Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.58
[11] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh,
“Cluster-GCN: An efficient algorithm for training deep and large graph
convolutional networks,” in Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery; Data Mining, ser.
KDD ’19. New York, NY, USA: ACM, 2019, pp. 257–266. [Online].
Available: http://doi.acm.org/10.1145/3292500.3330925
[12] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection,” ACM Transactions on Mathematical Software (TOMS),
vol. 38, no. 1, p. 1, 2011.
[13] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “ShiDianNao: Shifting vision processing closer to the
sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on
Computer Architecture (ISCA), June 2015, pp. 92–104.
[14] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for
vision,” in 2011 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops (CVPR Workshops 2011). IEEE,
2011, pp. 109–116.
[15] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
Scalable and efficient neural network acceleration with 3d memory,”
in Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 751–
764. [Online]. Available: http://doi.acm.org/10.1145/3037697.3037702
12
[16] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “TANGRAM:
Optimized coarse-grained dataflow for scalable NN accelerators,”
in Proceedings of the Twenty-Fourth International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp. 807–
820. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304014
[17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
“Neural message passing for quantum chemistry,” in Proceedings of
the 34th International Conference on Machine Learning-Volume 70.
JMLR.org, 2017, pp. 1263–1272.
[18] N. Greeneltch and J. X, “Maximize TensorFlow performance on CPU:
Considerations and recommendations for inference workloads,” https:
//software.intel.com/en-us/articles/maximize-tensorflow-performance-
on-cpu-considerations-and-recommendations-for-inference, 2019.
[19] Gunrock Developers, “Hive workflow report for GraphSage GPU im-
plementation,” https://gunrock.github.io/docs/hive/hive graphSage.html,
accessed: 2020-02-20.
[20] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi,
“Graphicionado: A high-performance and energy-efficient accelerator
for graph analytics,” in 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13.
[21] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning
on large graphs,” in Advances in Neural Information Processing Systems,
2017, pp. 1024–1034.
[22] Z. Huang, D. Zheng, Q. Gan, J. Zhou, and Z. Zhang, “Nodeflow
and sampling,” 2019, https://doc.dgl.ai/tutorials/models/5 giant graph/1
sampling mx.html#nodeflow, Accessed 2020-01-01.
[23] Intel Corporation, Intel Math Kernel Library. Reference Manual. Santa
Clara, USA: Intel Corporation, 2019.
[24] A. Jain, I. Liu, A. Sarda, and P. Molino, “Food discovery with Uber
Eats: Using graph learning to power recommendations,” https://eng.uber.
com/uber-eats-graph-learning/, accessed: 2020-02-20.
[25] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter
performance analysis of a tensor processing unit,” in Proceedings of
the 44th Annual International Symposium on Computer Architecture,
ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online].
Available: http://doi.acm.org/10.1145/3079856.3080246
[26] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram
simulator,” IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49,
2016.
[27] K. Kiningham, P. Levis, and C. Re, “GReTA: Hardware Optimized Graph
Processing for GNNs,” in Proceedings of the Workshop on Resource-
Constrained Machine Learning (ReCoML 2020), March 2020.
[28] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” in International Conference on Learning
Representations (ICLR), 2017.
[29] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network
dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
[30] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A flexible
dataflow accelerator architecture for convolutional neural networks,” in
2017 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, 2017, pp. 553–564.
[31] L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai,
“NeuGraph: Parallel deep neural network computation on large graphs,”
in 2019 USENIX Annual Technical Conference (USENIX ATC 19).
Renton, WA: USENIX Association, Jul. 2019, pp. 443–458. [Online].
Available: https://www.usenix.org/conference/atc19/presentation/ma
[32] Y. Ma, H. Ren, B. Khailany, H. Sikka, L. Luo, K. Natarajan, and B. Yu,
“High performance graph convolutional networks with applications in
testability analysis,” in Proceedings of the 56th Annual Design Automation
Conference 2019, 2019, pp. 1–6.
[33] D. Marcheggiani and I. Titov, “Encoding sentences with graph
convolutional networks for semantic role labeling,” in Proceedings
of the 2017 Conference on Empirical Methods in Natural Language
Processing. Copenhagen, Denmark: Association for Computational
Linguistics, September 2017, pp. 1507–1516. [Online]. Available:
https://www.aclweb.org/anthology/D17-1159
[34] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F.
Martı´nez, and C. Guestrin, “Graphgen: An FPGA framework for vertex-
centric graph computation,” in 2014 IEEE 22nd Annual International
Symposium on Field-Programmable Custom Computing Machines. IEEE,
2014, pp. 25–28.
[35] NVIDIA Corporation, cuSPARSE Library. NVIDIA Corporation, 2019.
[36] T. Oguntebi and K. Olukotun, “Graphops: A dataflow library for
graph analytics acceleration,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. ACM,
2016, pp. 111–117.
[37] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and
O. Ozturk, “Graph analytics accelerators for cognitive systems,” IEEE
Micro, vol. 37, no. 1, pp. 42–51, 2017.
[38] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J.
Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka,
C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S.
Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar,
D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius,
C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao,
F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada,
B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “MLPerf inference
benchmark,” arXiv preprint arXiv:1911.02549, 2019.
[39] P. Shivakumar and N. P. Jouppi, “Cacti 3.0: An integrated cache timing,
power, and area model,” Compaq Computer Corporation, Tech. Rep.,
2001.
[40] P. Velicˇkovic´, G. Cucurull, A. Casanova, A. Romero, P. Lio,
and Y. Bengio, “Graph attention networks,” in International
Conference on Learning Representations, 2018. [Online]. Available:
https://openreview.net/forum?id=rJXMpikCZ
[41] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha,
A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and
A. Raghunathan, “ScaleDeep: A scalable compute architecture for
learning and evaluating deep networks,” in Proceedings of the 44th
Annual International Symposium on Computer Architecture, ser. ISCA
’17. New York, NY, USA: ACM, 2017, pp. 13–26. [Online]. Available:
http://doi.acm.org/10.1145/3079856.3080244
[42] Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama,
C. Yuan, W. Liu, A. T. Riffel et al., “Gunrock: GPU graph analytics,”
ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, pp.
1–49, 2017.
[43] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehen-
sive survey on graph neural networks,” arXiv preprint arXiv:1901.00596,
2019.
[44] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are
graph neural networks?” in International Conference on Learning
Representations, 2019. [Online]. Available: https://openreview.net/forum?
id=ryGs6iA5Km
[45] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and
Y. Xie, “HyGCN: A GCN accelerator with hybrid architecture.” IEEE,
2020.
[46] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
J. Leskovec, “Graph convolutional neural networks for web-scale
recommender systems,” in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining.
ACM, 2018, pp. 974–983.
[47] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
“Hierarchical graph representation learning with differentiable pooling,”
in Advances in neural information processing systems, 2018, pp. 4800–
4810.
[48] H. Zeng and V. Prasanna, “GraphACT: Accelerating GCN training
on CPU-FPGA heterogeneous platforms,” in The 2020 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, 2020, pp.
255–265.
[49] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
13
[50] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,”
in Advances in Neural Information Processing Systems, 2018, pp. 5165–
5175.
14
