Rubik: A Hierarchical Architecture for Efficient Graph Learning by Chen, Xiaobing et al.
1Rubik: A Hierarchical Architecture
for Efficient Graph Learning
Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Member, IEEE, Abanti Basak, Ling Liang, Mingyu Yan, Lei
Deng, Member, IEEE, Yufei Ding, Zidong Du, Yunji Chen, Yuan Xie, Fellow, IEEE
Abstract—Graph convolutional network (GCN) emerges as
a promising direction to learn the inductive representation in
graph data commonly used in widespread applications, such
as E-commerce, social networks, and knowledge graphs. How-
ever, learning from graphs is non-trivial because of its mixed
computation model involving both graph analytics and neural
network computing. To this end, we decompose the GCN learning
into two hierarchical paradigms: graph-level and node-level
computing. Such a hierarchical paradigm facilitates the software
and hardware accelerations for GCN learning.
We propose a lightweight graph reordering methodology,
incorporated with a GCN accelerator architecture that equips
a customized cache design to fully utilize the graph-level data
reuse. We also propose a mapping methodology aware of data
reuse and task-level parallelism to handle various graphs inputs
effectively. Results show that Rubik accelerator design improves
energy efficiency by 26.3x to 1375.2x than GPU platforms across
different datasets and GCN models.
Keywords: Deep Learning Accelerator; Graph Neural Net-
work;
I. INTRODUCTION
With rich and expressive data representation, graphs demon-
strate their applicability in various domains, such as E-
commerce [1]–[3], computer vision [4], [5], and molecular
structures [6], and etc. To fully exploit the their value, ap-
proaches based on traditional graph analytic algorithms (e.g.,
BFS, SSSP) facilitate in-depth understanding of objects-wise
relationships in graphs (e.g., molecule similarity in chem-
istry [6], cells structures in bioinformatics [7], [8], and seman-
tic graphs in computer vision [4], [5]). Recently, as a rising
star, extending deep learning techniques to graph analytics
has gained lots of attention from both research [9]–[12] and
industry [13], [14], largely because of their striking success
on Euclidean data (e.g., images, videos, text, and speech).
Moreover, such geometric deep learning techniques based on
graph neural networks (GNNs) not only learn the inductive
Xiaobing Chen is with State Key Laboratory of Computer Archi-
tecture, Institute of Computing Technology, Chinese Academy of Sci-
ences, also with University of Chinese Academy of Sciences, Bei-
jing 100190, China. Xing Hu, Mingyu Yan, Zidong Du, and Yunji
Chen are with State Key Laboratory of Computer Architecture, In-
stitute of Computing Technology, Chinese Academy of Sciences, Bei-
jing 100190, China. (email:chenxiaobing@ict.ac.cn, huxing@ict.ac.cn, yan-
mingyu@ict.ac.cn, duzidong@ict.ac.cn, cyj@ict.ac.cn). Yuke Wang, Yufei
Ding are with the Department of Computer Science, University of California,
Santa Barbara, USA. (email: yuke wang@ucsb.edu, yufeiding@cs.ucsb.edu).
Xinfeng Xie, Abanti Basak, Lei Deng, Ling Liang, and Yuan Xie are with the
Department of Electrical and Computer Engineering, University of California,
Santa Barbara, USA. (email: xinfeng@ucsb.edu, abasak@umail.ucsb.edu,
lingliang@ucsb.edu, leideng@ucsb.edu, and yuanxie@ucsb.edu). Xing Hu is
the corresponding author.
representations for end-to-end tasks such as classification,
clustering, and recommendation, but also show much better
accuracy (more than 98%) than traditional methods [1] (e.g.,
random walks [15], and matrix factorization [16]).
Among various kinds of GNNs [9], [17], graph convolu-
tional network (GCN) is the most fundamental model and has
been widely studied. Different GCNs can be summarized and
abstracted into a uniform computing model with two stages:
Aggregate and Update. Aggregate stage collects the localized
node representations from the neighboring nodes. Update stage
derives the representation vector with the aggregation results.
GCN distinguishes itself because of combining both neural
network computing and graph computing schemes in this two
stages, thus suffering from the following challenges.
Entangled hybrid paradigms raise the difficulty of ef-
ficient computing in the uniform hardware architecture.
Specifically, aggregate operation is largely based on non-
Euclidean graph-level data, which is non-ordered with a di-
verse range of node sizes and topology. Due to the irregular
memory accesses, graph-level non-Euclidean data cannot be
easily handled by NN accelerators good at spatial data reuse
with the statically configured vertical, horizontal, or diagonal
dataflow [18], [19]. On the other hand, update computing
consists of regular vector and matrix computations, which is
computing resource hungry. For example, GCNs features high-
dimension node/edge embedding (10x -1000x [9], [20] than
that of traditional graph computing) with complex NN opera-
tion (e.g., Multilayer Perceptron), while the traditional graph
computing works with simple arithmetic operations (e.g., addi-
tion) on nodes with scalar values. Such computing paradigm
can be hardly handled by graph accelerators with resource-
intensive on-chip cache for suppress irregular accesses [21].
Thus, existing NN accelerator and graph accelerator designs
pale in their effectiveness for handling GCN computing.
Workload diversity and graph irregularity raise the
difficulty for efficient task mapping to utilize the hardware
capability adaptively. When the input graph has a larger fea-
ture dimension, the GCN computing shifts to NN computing
and demands more multiply-and-Accumulator (MAC) arrays
for powerful computation capability. However, when the input
graph has a large number of nodes with high average degrees,
the GCN computing shifts closer to graph computing that de-
mands large on-chip buffer and data management methodology
to eliminate the irregular memory access. Hence, it is essential
to design an efficient task mapping methodology to bridge the
gap between the diverse application demands and the uniform
hardware platform.
ar
X
iv
:2
00
9.
12
49
5v
1 
 [c
s.A
R]
  2
6 S
ep
 20
20
2To this end, we decouple the entangled graph-level com-
puting and node-level computing, which facilitates the soft-
ware and hardware optimizations for graph learning. Such
decoupled hierarchical computing paradigm is based on the
following observations: 1) GCN learns both the graph-level
and node-level features; 2) graph-level computing and node-
level computing exhibit distinct architectural characteristics.
Specifically, node-level computing refers to intra-node com-
puting during feature extraction and update on node-level
Euclidean data with neural network techniques. Graph-level
computing refers to the process of graph traversal for localized
neighboring reduction (feature reduction in Aggregate) on
selected graph-level non-Euclidean data.
We then propose the scheduling & mapping strategies to
tackle the irregular memory access issue of the former and
hardware architecture design to optimize the latter. In detail,
we carry out a lightweight graph reordering on the input
graphs for more graph-level data reuse potentiality. Then, we
propose the programming model and tailor the neural network
accelerator that incorporates a hierarchical spatial architecture
with specialized cache design, to leverage the input graphs’
data locality. To bridge the gap between the diverse graph
applications and uniform architectures, we propose a hierar-
chical mapping methodology to improve both the data reuse
and task-level parallelism.
Overall, we make the following contributions in this work:
• We decouple GCN computing to two paradigms: 1) the
relatively fixed and regular node-level computing, and 2)
the dynamic and irregular graph-level computing. Such
a decoupled computing paradigm facilitates the software
and hardware optimization for GCN applications.
• We propose a lightweight graph reordering method to
facilitate graph-level data reuse and intermediate com-
putation results reuse. Furthermore, we design a GCN
training accelerator, Rubik, cooperated with graph re-
ordering to support the hybrid paradigms of both node-
level computing and graph-level computing.
• We propose the hierarchical task mapping strategies for
graph-level computing and node-level computing, which
comprehensively optimize both data reuse and task-level
parallelism to well adapt diverse datasets with different
feature sizes and graph topologies to the hardware plat-
form.
• Intensive experiments and studies show that the graph
reordering and hierarchical mapping eleminates 69% and
58% of the off-chip memory accesses for GraphSage and
GIN. Rubik outperforms GPU with 26.3x to 1375.2x of
better energy efficiency.
II. BACKGROUND
In this section, we introduce the GCN basics, the abstract
computing model, and the variants derived from GCNs.
A. GCN Basics
The target of graph convolutional neural networks is to
learn the state embedding of a graph property (node, edge,
or subgraph) from the non-Euclidean input graph structure.
Such state embeddings transform the graph features to low-
dimension vectors, which are used for node, edge classifica-
tion [22]–[24], and graph clustering [8], link prediction [25]–
[27]. In the scope of node classification tasks, we define a
graph, G = (V,E), where V and E are vertex and edge
sets, respectively; each node has node feature vectors Xv for
v ∈ V ; and each edge has edge feature vectors Xe for e ∈ E.
On such a graph, GCNs learn a representation vector of a
node (hv), an edge (he), or the entire graph (hG) with the
information of the graph structure and node/edge features, so
that the corresponding classification tasks can be completed
based on the representations.
In terms of the computing paradigm, GCNs has two main
categories: spectral GCN [28]–[30] and spatial GCN [9], [11],
[12], [17], [31]. The former are derived from graph signal
processing and its mathematical representation is based on
eigen-decomposition of graph Laplacian matrix [10]. However,
spectral GCNs fall in short in several aspects: 1) The inability
to perform inductive learning due to the fact that Laplacian
decomposition is fixed to a specific graph; 2) The inefficiency
to handle large graphs since it demands the decomposition for
the entire graph adjacency matrix. On the other side, spatial
GCNs emerge to learn the inductive representation based on
the graph computing paradigm, which identifies the spatial
aggregation relationships of nodes/edges. Therefore, spatial
GCNs is capable to generate embeddings for unseen nodes,
edges, or subgraphs. Moreover, spatial GCNs can process large
graphs without compromising performance. In addition, previ-
ous works and in-depth studies [9], [13], [17] also demonstrate
spatial GCN as a promising direction. Base on its potential of
informativeness and powerfulness, we concentrate on spatial
GCN for further exploration in this work.
B. GCN Computing Model
The GCN training process consists of the following three
stages: forward propagation, loss calculation, and backprop-
agation. The forward propagation calculates node feature by
iteratively incorporating the impact of its neighbors, which
finally outputs the status of each node comparing against the
ground truth for loss computation. The backpropagation finds
the impact of each state on the loss by propagating from the
last layer to the input layer based on the chain rule of the
gradient. It is similar as the forward propagation but in a
reverse direction.
Algorithm 1: GCN Algorithm.
Inputs: Graph (V,E); input features {Xv,∀v ∈ V }; depth K;
weight matrices {W k, ∀k ∈ K}; aggregator functions
{AGGREGATEk, ∀k ∈ K}; neighborhood function
{N : v → 2V }
Output: Vector representation zv for all v ∈ V h(0)v = Xv
for k = 1...K do
for v ∈ V do
a
(k)
v = AGGREGATE
k({h(k−1)u |u ∈ N(v)})
h
(k)
v = UPDATE
(k)(h
(k−1)
v , a
(k)
v )
end
end
zv = h
(k)
v
3We detail the process of the forward propagation by taking
the node classification as an example. The forward propagation
stage of modern GCNs works in an iterative manner, as shown
at the for-loop iteration in Algorithm 1. Assume a node v in
Graph (V,E) with embedding h(0)v that initialized as Xv , and
N(v) refers to the set of v’s neighbors. a(k)v and h
(k)
v are
the aggregation results and the node embedding of v after
the completion of the k-th layer of a GCN. The computation
process of GCN repeats the following two steps: 1) Aggregate
the node representation from the neighboring nodes; 2) Update
the representation vector based on the aggregation results
and its previous state. (some work also adopt the term of
“Combine” instead of Update [32]). The forward propagation
process is illustrated in Figure 1 which shows the cases with
two iterations. The backward propagation process is similar to
the process shown in Figure 1 by aggregating the gradient of
a
(k+1)
v when computing the gradient of h
(k)
v .
 
Fig. 1. GCN forward propagation flow with two iterations.
Many variants of the functions AGGREGATE(k)(.) and
UPDATE(k)(.) have been proposed to improve the predic-
tion accuracy or to reduce the computation complexity of
GCNs. For example, convolutional aggregators are used in
graph convolutional neural networks, attention aggregators are
used in graph attention neural networks [33]. Gate updaters are
adopted in gated graph neural networks or graph LSTM [34].
Although there are many variants of GCN models [7], they can
be abstracted into the uniform computing model discussed in
Section II-B. Hence, Without loss of generality, we focus on
graph convolutional neural networks in this work.
III. CHARACTERIZATION IN GCNS
A. Hybrid Computing Paradigms in GCN
GCN forward propagation process is entangled with two
computing paradigms: 1) the graph-level computing during
node travesal and aggregating the node representations from
the neighborhood in the aggregation stage; and 2) the node-
level computing during extracting or updating features based
on deep neural network techniques. These graph-level and
node-level computing paradigms demand different hardware
resources. For example, neural network computing on node-
level Euclidean data introduces heavy vector and matrix com-
putation but regular memory accesses, thus dataflow optimiza-
tions can easily enlarge data reuse and eliminate the off-
chip memory accesses [18]. While graph-level computing is
mainly memory-bounded because of the irregular accesses in
a non-Euclidean graph structure, which can be hardly handled
by the data reuse strategies in neural network computing.
Hence, computation and memory demands vary for different
input datasets with diverse graph topology and node feature
dimensions.
1E+7
2E+8
4E+8
6E+8
8E+8
1E+9
1
6
6
4
1
2
8
2
5
6
1
6
6
4
1
2
8
2
5
6
1
6
6
4
1
2
8
2
5
6
COLLAB reddit citeseer
L
a
te
n
c
y
Feature Size
NN Accelerator Graph Accelerator
2E+4
1E+7
2E+7
3E+7
4E+7
1
6
6
4
1
2
8
2
5
6
1
6
6
4
1
2
8
2
5
6
1
6
6
4
1
2
8
2
5
6
BZR IMDB-BIN DD
Feature Size
(a)
0
1
2
3
4
COLLAB BZR IMDB DD REDDIT CITESEER-S
N
o
rm
a
liz
e
d
 L
a
te
n
c
y
Dataset
NN-Acc Graph-Acc
CITESEERREDDIT
- Graph-A c
(b)
IMDB
Fig. 2. (a). Performance comparison with diverse applications. (b). Perfor-
mance comparison with different feature size.
We further quantitatively evaluate the GCN performance of
diverse input graphs with different feature sizes and degree
distributions on two platforms: NN accelerator (NN-Acc) and
Graph-like accelerator (Graph-Acc). 1) NN-Acc: We imple-
ment an NN accelerator similar to Eyeriss [18], which has
larger MAC arrays in every PE and has no private cache
for graph traversal data buffering. The dataflow is similar to
Eyeriss, which enables MACs to support efficient data reuse.
The detailed configuration of the NN accelerator is shown in
Table II. 2) Graph-Acc: We tailor the graph accelerator to
execute the graph convolutional neural networks. The Graph
accelerator closely resembles a prior Graph accelerator [21],
which is equipped with a large on-chip buffer and the pro-
cessing array to deal with the matrix-vector multiplication.
The detailed configuration is shown in Table II. We evaluate
six GCN datasets on GIN (detailed configurations are in
Section V-A) and the results are shown in Figure 2. We have
the following observations:
1) Computing input graphs with lower degree shifts to NN
computing mode and favors more computing resources. For
example, BZR, DD, and Citeseer-S have an average degree of
1.1, 2.5, 3.6, NN accelerator performs better than the Graph
accelerators.
2) Optimization for non-Euclidean graph-level data reuse
plays a much more important role for training input graphs
with a larger average degree. For example, COLLAB, IMDB-
BINARY, and REDDIT have an average degree of 32.8, 4.8,
and 492. Thus, Graph accelerator performs better than NN
accelerator.
3) NN accelerator is extremely under-utilized because of
the memory inefficiency for most of the GCN models. Taking
the REDDIT in Figure 2(b) as an example, the execution
latency of the NN accelerator stays still even the output
4dimension scales from 16 to 256, which indicates that the
computation capability is under-utilized and NN accelerator is
heavily memory-bounded which largely incurred by the graph
irregularity.
In summary, GCNs favor NN-Acc with powerful computa-
tion capabilities and optimizations for spatial data reuse when
the input graph has a high feature dimension, while GCNs
appreciate Graph-Acc with larger on-chip memory when input
graphs exhibit high irregularity and complex topologies. Thus,
there are two important issues to be addressed for design-
ing GCN acceleration architectures: 1) how to optimize the
memory access efficiency of graph-level data; and 2) how
to design efficient and feasible architectures for input graphs
with diverse graph scales and feature dimension sizes when
algorithms constantly evolve.
B. Opportunities in Graph-level Data Reuse
We observe that there are two different types of data reuse
opportunities in GCNs: node-level (Euclidean) and graph-level
(non-Euclidean) data reuse. Taking the illustrative case in
Figure 3(b) as an example, during the update stage, feature
vectors of node6 are fed in the neural networks as input.
Such neural network computing for node-level data has been
well studied in the previous work [18]. Thus the spatial
architectures that exploit high compute parallelism using direct
communication between processing elements (PEs) can be
used to optimize the data reuse in either vertical, horizontal,
or diagonal directions [18].
During the graph feature computation in the aggregation
reduction stage, the irregular memory access cannot be effi-
ciently handled by the Euclidean dataflow methodologies that
exploit high spatial locality through using direct communica-
tion between processing elements in either vertical, horizontal,
or diagonal directions. However, because of the intrinsic
graph feature in the real-world graphs, such as “community”
structure that some nodes share neighbors or have denser
connections to a group of nodes, thus offers two types of
graph-level data reuse opportunities: graph-level feature data
reuse (G-D) and graph-level computation results reuse (G-C).
1) Graph-level feature data reuse: The node feature data
can be potentially reused during graph traversal in the aggre-
gation computation. As shown in Figure 3, when computing
neighbor aggregation for node6, feature data of node4, node5,
and node8 will be accessed. When computing neighbor aggre-
gation for node2, node4 and node5 will be accessed. Hence,
the feature data of node4 and node5 will be repetitively reused
if we traverse the graph for aggregate computing with the
order of node2 and node6. Such data reuse of node feature
data during graph traversal is referred to as graph-level feature
data reuse. The reuse distance is determined by the graph
topologies and traversal order.
2) Graph-level aggregation computation reuse : The inter-
mediate aggregation results can be potentially reused because
of the shared neighbor sets in the “community” structure of
graphs and the order-invariant feature of aggregation operators.
The aggregation reduction operations are commonly based on
sum, average, or min/max. The computing order doesn’t affect
the final result. Hence the intermediate computation results of
shared neighbor sets can be reused. For example, the node2
and node6 have the shared neighbor sets: node4 and node5.
The intermediate results of aggregating node4 and node5 can
be reused when computing the node2 and node6, as illustrated
in Figure 3(b). Benefits of computation reuse come from
two folds: 1) eliminating the useless redundant computing
of feature vectors; 2) alleviating the memory burden and
data thrashing during redundant computation of node feature
vectors.
In summary, significant volume of graph-level data locality
hide during the irregular graph traversal. With the limited
on-chip memory resources, graph scheduling strategies are
important to reduce the data reuse distance for more efficient
memory accesses.
V2
V4
V5
V8
Aggr
Aggr V6
V2
Graph-level 
Data Reuse
Graph-level 
Computation Reuse
Update
Update
Node-level 
Data Reuse
  
  
  
  
  
  
  
  
V6
                 (a)                                                              (b)
Fig. 3. (a) An example of input graph; (b) Data reuse schemes: graph-level
data reuse, graph-level intermediate computation reuse, and node-level data
reuse.
M
e
m
o
ry
 C
o
n
tr
o
lle
r
Scheduler&Mapper
Global Buffer
PE Array
PE PE PE
PE PE PE
Mac Array
Instruction Queue
Ld
/S
t 
Q
u
eu
e
G-D
Cache
G-C
Cache
RF
ALU
RF
ALU
RF
ALU
M
e
m
o
ry
 C
o
n
tr
o
lle
r
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
Input Graph 
Rubik Processing Element
NOC queue
Before ordering
  
  
  
  
  
  
  
  
After reordering
    
  
  
  
  
  
      
  
  
  
  
  
  
E
x
e
c
u
ti
o
n
 o
rd
e
r
Access footprint
                
  
  
  
  
  
  
  
  
E
x
e
c
u
ti
o
n
 o
rd
e
r
Access footprint
Graph-level mapping Node-level mapping
Tiling Weight Data Feature Data
32
32
PE0 PE1
V2, V4, V6, V5 V8, V3, V1, V7
  
  
  
  
  
  
  
  
MAC array
PE Array
V7
Input Graph 
Scheduling
Hierarchical 
Task Mapping
Rubik Acc
3
1
Fig. 4. Design overview of Rubik: 1) Input graph reordering that groups
the nodes with more shared neighbors together to reduce reuse distance; (2)
Hierarchical task mapping; (3) The Rubik architecture.
IV. RUBIK DESIGN
In observations of the challenges and opportunities of
GCN applications, we propose Rubik to fully utilize both
the graph&node level data locality and computation paral-
lelism. The key design concept is to decouple the entangled
non-Euclidean computing and Euclidean computing, propose
5    
  
  
  
  
  
      
  
  
  
  
  
  
  
  
  
  
  
  
  
  
                
1 LSH  
Reordering
R-02,04
R-05,06
R-03,01,07,08
(a)                                                                     (b)                                                                                 (c)
Vertex 
Set
Aggregration 
Reuse
V3, V8 Aggr(V1, V7)
V2, V6 Aggr(V4, V5)
V1, V7 Aggr(V8, V3)
V4, V5 Aggr(V2, V6)
2
E
x
e
c
u
ti
o
n
 O
rd
e
r
Shared Node Set 
Exploration   
  
  
  
  
  
  
  
                
Memory access footprint
Feature vector for Update Feature vector for Aggregate
Fig. 5. Input graph reordering: (a) Index order before ordering; (b) LSH-based row reordering; (c) Shared-Set Exploration.
software-based methodology to optimize the former and hard-
ware architecture design to optimize the latter.
As shown in Figure 4, Rubik mainly consists of three
parts: 1) the input scheduling methodology that utilizes graph
reordering to determine the traversal order for smaller reuse
distance of graph-level feature data and computation results; 2)
the mapping methodology to allocate the tasks to processing
elements for both computation parallelism and data reuse. 3)
the hardware architecture design that cooperates with schedul-
ing and mapping methodology to leverage both the global-
level and node-level data locality. We enhance neural network
accelerators in a lightweight way, so that minimum effort is
needed to tailor the neural network accelerator for efficient
GNN computing.
A. Input Graph Ordering
In this section, we introduce a lightweight graph reordering
methodology which improves the graph-level data locality. In
this work, the graph reordering happens at the pre-processing
stage for only once. Such ordering can be integrated in
the graph pre-processing in GNN algorithms that adopt the
graph topology features for more efficient training [35]. More
discussion about the overhead and feasibility is in Section VI.
The goal of reordering is to group the nodes with more
shared neighbors together to improve the graph-level data
reuse when conducting aggregation reduction operations. The
intrinsic reason that the reordering method can provide better
temporal reuse is based on the fact that real-world graphs
exhibit a “community structure” [36], which means that some
nodes share neighbors or have a closer relationship to each
other. Therefore, by grouping them together, the data locality
during execution will be significantly improved. Note that
graph reordering does not change the graph structure but only
affects the execution order of the graph. We develop the graph
reordering algorithm by synergistic Locality-Sensitive Hashing
and Row-Column Ordering.
1) LSH-based Graph Reordering
Locality-Sensitive Hashing (LSH) is an algorithmic tech-
nique, being widely used to solve the approximate or exact
nearest neighbor problem in high dimension space [37]–[39].
It groups similar items into the same “buckets” with high
probability. The basic concept based on random projection:
for every input vector x, the hash function is calculated by
projecting this vector x to several random vectors. With a
series of random vectors, LSH maps an input vector to a bit
vector (buckets). Input vectors with smaller distances have a
higher probability to result in the same cluster with the same
bit vector.
Reordering Flow: We leverage the LSH technique to cluster
the nodes with more shared neighbors. Every row in the
adjacency matrix of the graph is a vector that represents the
neighbor connections for this vertex. Taking these vectors of
rows in the adjacency matrix, LSH hashing groups the rows
into several clusters. Taking the input graph in Figure 3(a) as
an example, the processing flow is illustrated in Figure 5. Row-
02 and Row-04 are grouped in the cluster because they share
most of the neighbors and have similar row vectors. Similarly,
Row-05 and Row-08 are grouped in a cluster. Row-03, Row-01,
Row-07, and Row-08 are grouped in a cluster. Thus, after the
row transformation in step 1, we have the transformed graph
with the nodes assigned to the same buckets being placed
continuously, as illustrated in Figure 5(b). In this way, the
reuse distance of the node feature vectors are reduced.
2) Shared Node Set Exploration
Based on the reordered graph, we explore the reuse po-
tentiality of the intermediate aggregation computation results.
The basic idea is to find the shared node sets in the window
of neighboring rows. A simple example is illustrated in Fig-
ure 5(c), where V2 (node2) and V6 share the neighbor set of V4
and V5. Therefore, the intermediate results of V4 and V5 can be
reused for V2 and V6 aggregation computation. Similarly, the
intermediate computation result of V1 and V7 can be reused
for V3 and V8.
Considering it is too time-consuming to obtain shared node
sets that maximize the potential computation results reuse, we
adopt an alternative heuristic by limiting the search window
instead, i.e, finding the shared node set only in adjacent nodes
in the execution order. For instance, (V2, V6), (V4, V5), (V3,
V8), (V1, V7) in the simplified case illustrated in Figure 5(b).
B. Hardware Accelerator
We tailor the neural network accelerator to fully utilize
the graph-level data locality. Specifically, Rubik accelerator
supports both spatial and temporal data flow for regular (node-
level) and irregular (graph-level) computing, enhanced with
both G-D cache and G-C cache for graph-level data reuse and
computation reuse.
The architectural design of Rubik is demonstrated in
Figure 4. Rubik is mainly comprised of the following basic
components: processing element (PE) array, on-chip memory
hierarchy, and control logic.
61) PE Array
The overall PE array is hierarchically organized based on
multiple PEs constituted by MAC arrays. The graph-level
computing tasks are dispatched to PE array and node-level
computing tasks are scheduling inside the MAC array, for
the ease of programming and optimization for utilization.
Multiple PEs connected with the 2D-mesh network on chip
(NoC) interconnections. The leftmost and rightmost PEs in the
NoC mesh communicate with the memory controller directly.
The other central PEs get the read and send write requests
through the 2D-mesh NoC. All the traffic between PEs and
memory goes through the NoC network in a first-come-first-
serve manner with the one-way routing strategy. There are
two memory controllers in Rubik at the left and right side
of PE arrays. The access location of a memory request is
determined by the memory address. Once the access location
is determined, the memory request is transferred through the
NoC at either left-horizontal or right-horizontal directions.
The detailed design of PE is shown in Figure 4(e), which
consists of the instruction queue, load-store queue (LSQ), NoC
queue, multiply and add accumulator (MAC) array, and two
private caches (G-C and G-D cache) for data reuse of graph-
level non-Euclidean data. Instruction Queue buffers the micro-
instructions including three major categories: load, store, and
computation. The entire GCN training process can be trans-
lating to hardware primitives according to the input graphs.
The detailed programming model and hardware primitives are
in Section IV-C. The micro-instructions can be generated by
the driver and prefetched to the instruction queue with the
streaming strategy for good access efficiency. LSQ buffers the
load and store requests for accessing the feature extraction
data, aggregation data, and update data. G-D and G-C caches
store the graph-level feature data and computation results.
2) On-chip Memory Hierarchy
Hierarchically, the on-chip memory is comprised of global
buffer for PE array, private G-D and G-C cache in every
PE, and register files (RFs) in every MAC. Global buffer
exploits data reuse between PEs, such as the weight matrices.
Except for the weight metrics, all other store requests are
write-through and sent back to the memory controller directly
without on-chip buffering. MAC register files are similar
to that in NN accelerators which exploit all types of data
movement within the computations of one node, including
the convolutional reuse and filter reuse during node-level
computing.
Private G-D and G-C caches exploit graph-level data locality
in a temporal manner, by buffering the feature vectors and
intermediate aggregation results of graph-level non-Euclidean
data inside every PE. Tasks in different PEs do not have non-
Euclidean data reuse nor any data dependency, in order to
improve task-level parallelism with out any cache conflict.
It is important to well adapt GNN applications with diverse
graph scales and feature dimension sizes to hardware accel-
erators with careful consideration about task parallelism and
data reuse efficiency. The detailed mapping methodology is
introduced in Section IV-D.
The working flow is as follows: during the calculations of
aggregation operations, PE first tries to search feature vectors
of neighbors in G-D cache. If it is not a hit, PE gets the
feature vector data from off-chip memory, and then stores
the feature vector data of neighbors in G-D cache. If the
computation reuse optimization option is initiated, PE searches
the G-C for the intermediate aggregation results with the tag
of node index ids. If it is a hit, the results will be obtained
directly for the following computation, which eliminates the
redundant computing. Otherwise, PE will search G-D again
for feature vectors of neighbors individually. For the ease of
implementation and reduce the storage overhead of tag bits, the
reuse of intermediate aggregation results is at the granularity
of two nodes. Both G-C and G-D cache adopt the LRU (least
recently used) replacement strategy since graph ordering stage
already optimizes the reuse distances.
C. Programming Model and Hardware Primitives
To generally support diverse GCN algorithms, we adopt
a vertex-centric programming model, since most graph neural
networks are based on this model [9], [11], [17], [35]. Based
on the vertex-programming model, we provide the following
hardware primitives to support the execution of GCNs in
Algorithm 1: load-f, load-i, comp, and store. The first two
primitives, load-f and load-i, are used to load and aggregate
the feature vector of single node and the intermediate aggre-
gation result of two nodes. The third primitive, comp, is used
to invoke the computation of feature extraction and update
function, which is usually composed of matrix-vector mul-
tiplication and some element-wise computation instructions.
After computing the feature vector of a node v for the k-th
layer (h(k)v ), the store primitive is used to flush the result of
computation into memory so that it is visible to other PEs in
the iteration of (k + 1)-th layer.
Using vertex-centric programming models has no need to
worry about the data conflict issue in edge-centric program-
ming, but is confronted with synchronization issues during
execution. Such overhead is introduced when the node update
operation is blocked due to waiting for neighbors to be
aggregated. Thus, we propose a graph reordering method and
intelligent mapping to alleviate irregular memory access effect
and the corresponding synchronization overhead, while retain-
ing the task-level parallelism. The reordering and mapping
stage generate two inputs to the hardware accelerator. The
first input is the task assignment with the ordered vertex ID.
Each PE is assigned with a set of vertices to compute. The
second input is the indicator for the reuse of the intermediate
aggregation results, which generates the load-i instructions.
The hardware accelerator executes these hardware primitives
generated by the reordering and mapping stages, which exploit
the locality of feature vectors and the computation reuse of
partial intermediate results.
D. Mapping Methodology
With the input reordered graph, we map the tasks onto the
Rubik accelerator in a hierarchical manner. Specifically, task
mapping first partitions the input graph, and decides the node
set allocations to every processing element, which is referred
to as graph-level mapping. Then the intra-node computations
7PE0
G-D Cache
G-C CacheMac
Array
  
Feature Data of
PE1
G-D Cache
G-C CacheMac
Array
Feature Data of
  
G-D Cache
G-C CacheMac
Array
        
      
G-D Cache
G-C CacheMac
Array
G-D Cache
G-C CacheMac
Array
  
      
  
        
G-D Cache
G-C CacheMac
Array
G-D Cache
G-C CacheMac
Array
G-D Cache
G-C CacheMac
Array
            
    
            
    
            
        Ta
sk
 L
ev
el
 P
ar
al
le
lis
m
Timeline
Tiling
Weight Data
Feature Data
32
32
      
          
                    
    
(a) (b)
Feature Data of
V2 V6
V4 V5
V8 V3
V1 V7
Fig. 6. Hierarchical task mapping: (a) Graph-level mapping (Node sets allocation in PEs); (b) Node-level mapping (Intra-node task tiling in MAC array).
are organized into MAC arrays, which is referred to as node-
level mapping.
1) Graph-level mapping: The mapping strategy of allocating
vertices to PEs considers both data reuse and task-level par-
allelism. After graph reordering, the nodes in the traversal se-
quence have a similar set of neighbors, which enables both the
input data reuse and intermediate computational result reuse.
Hence, we allocate the consecutive nodes in a window of
reordered traversal sequence in one PE, while every individual
PE computes a different window for task parallelism.
Taking the Graph in Figure 5 for instance, the execution
order after ordering is V2, V6, V4, V5, V8, V3, V1, and V7.
With the window size of 4, V2, V6, V4, V5 are allocated in
PE0, while V8, V3, V1, and V7 are allocated in PE1. Such
a process is illustrated in Figure 6 (a). In PE0, V2, V4, V5,
and V6 will be executed sequentially. During computing the
aggregation operations for V2, feature data of V4 and V5 are
obtained from off-chip memory and buffered in G-D cache.
Since there is an indicator of shared node sets of (V4, V5), the
intermediate aggregation results of them will be stored in G-
D cache for further reuse. When computing the aggregation
operations for V6, feature data of V2, V8, and intermediate
results of (V4, V5) are needed. V2 and (V4, V5) are hit in G-
D cache and G-C cache respectively, therefore we only need
to get the feature data of V6 and V8. When computing the
aggregation and update operations of V4 and V6, all the feature
data of neighbors are in cache and no off-chip memory traffic
is introduced during computation. Such an example shows that
the graph-level tasking mapping based on the reordered graph
improves the temporal reuse locality for the vertices in the
same PE.
2) Node-level mapping: For the feature vector computation
inside nodes (feature extraction and update), we tile the vector-
matrix multiplication onto the MAC arrays for a better data
reuse. Such mapping and tiling techniques have been well
studied in previous work [18], [19]. We adopt a similar
methodology, as shown in Figure 9(b). The matrix-vector
multiplication is partitioned to several blocks according to the
computation capability of MAC arrays.
In summary, such a hierarchical task mapping method de-
couples the irregular graph mapping and regular node mapping
for better data reuse and computing parallelism.
E. Dataflow in different Computation Stages
In this subsection, we introduce the computing and data
reuse process of the whole forward propagation. The back-
propagation is similar but in a reverse way. As introduced
in Section II, the whole processing pipeline of the forward
propagation is comprised of aggregation reduction and update.
The detailed forward propagation computation and dataflow
are shown in Figure 7.
Overall, the data reuse in Rubik can be generally classified
into two categories: the reuse of graph-level data and node-
level data. For the node-level computation, such as feature
extraction and update stages, the feature map data and weight
matrix are reused in MAC arrays. For the graph-level com-
putation on the node set for aggregation, the feature data is
stored in the private cache of every PE for temporal reuse.
Feature extraction for nodes is initiated at the beginning of
every iteration. During this process, the feature data of nodes
are streaming in and streaming out to memory systems. Weight
data is stored onto the global buffer and reused for the feature
extraction of every node.
Aggregation. After the completion of feature extraction,
Rubik conducts aggregation reduction for every node, by
loading and computing the feature data of its neighbors. The
feature data is buffered in the private (G-D and G-C) cache
of PE. Along with the aggregation for the nodes in the input
graph, there is temporal reuse of feature map data in G-D
cache and intermediate aggregation results in G-C cache. Such
temporal data reuse reduces off-chip memory accesses and the
Section V-B discussed the effect with a quantitative analysis.
Update. With the aggregation results of a node as input, the
update operation is carried on by calculating the aggregation
results and the previous state of this node. During such a
regular computing process, the weight data and feature data
are reused in MAC arrays and the global buffer. The final
result of the updated feature data will be written through to
off-chip memory directly.
V. EXPERIMENTAL RESULTS
In this section, we first introduce the experimental setup
and analyze the performance impact of graph reordering and
mapping methodologies. Then we compare the performance
and energy of Rubik to NN accelerator, GPU, and CPU
8Feature Extraction Aggregation Update
Memory
Global Buffer
ALU
RF
ALU
RF
ALU
RF
ALU
Mac Array
M
em
or
y 
H
ie
a
ra
ch
y
Inference
  
   
   
  
   
      
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
RF
ALU
Fe
at
u
re
 D
a
ta
Weight Data
Weight  data from 
memory
Feature  data from/to 
memory
Weight  data from 
global buffer
Feature  data from 
private cache
RF
Private Cache
Processing Element
Training
Fig. 7. Dataflow in Rubik: an example of weight data, feature data, and intermediate aggregation results reuse across four layers of memory hierarchy.
Finally, we analyze the impact of embedding size and graph
degree on performance and show that Rubik can well adapt
diverse applications on the hardware platform.
A. Experimental Setup
GCN Datasets. Our Graph accelerator evaluation covers a
wide spectrum of mainstream graph datasets, including bench-
mark datasets for graph kernels [20], and datasets commonly
used by previous studies [9], [35] in related domains. Details
of these datasets are listed in Table I. We also build a synthetic
benchmark of the citeseer [40], named Citeseer-S, which has
227,320 vertices with the dimension of 3,703. Such a relatively
large graph with a high dimension is built to test the hardware
capability.
TABLE I
GRAPH DATASETS
Dataset #G Avg.#V Avg.#E D #Class
COLLAB 5,000 74.49 2,457.78 492 3
BZR 405 35.75 38.36 53 2
IMDB-BINARY 1,000 19.77 96.53 136 2
DD 1,178 284.32 715.66 89 2
Dataset #G #V #E D #Class
CITESEER-S 1 227,320 814,134 3,703 41
REDDIT 1 232,965 114,615,892 602 6
GCN Models. In this work, we mainly test on two commonly-
used graph convolutional neural network models: GIN [32]
and GraphSage [9]. We use the default configuration in
broadly-used GCN library (Pytorch Geometric (PyG) [41]),
where GraphSage has 2 sageConv layers with hidden dimen-
sion = 256, GIN has 5 sageConv layers and 2 linear layers
with hidden dimension = 128.
Hardware Configurations.
1) Rubik: We implement a cycle-accurate simulator to
evaluate the total execution latency (cycles), while the ac-
celerator is working conservatively at 500Mhz, as simulated
in Section V-D. This simulator models the modules in the
architecture design, including PE, NoC, on-chip buffer, private
cache, and etc, as introduced in Section 4. The configuration
of Rubik is shown in Table II.
2) GPU: In addition to accelerators, we also evaluate the
GCN performance on NVIDIA Quadro P6000 GPU (3840
CUDA cores, 12TFLOPs peak performance, 24GB GDDR5X
TABLE II
HARDWARE PLATFORM CONFIGURATIONS
NN-Acc Graph-Acc Rubik GPU
Comp PE Array 8x8 PEs 8x8 8x8 3840 CoresMAC Array 16x16 MACs 1x4 4x8
Mem
Mem BW 32GB/s 432GB/s
Global buffer 2 MB 4 MB 2 MB L2: 3MB
Private Cache – 256KB/PE 128KB/PE L1:48KB/SM
RegisterFile 16KB/PE 256B/PE 2KB/PE RF: 48K/SM
memory, 432GB/s peak bandwidth). The GCN implemen-
tations are based on PyG [41]. The GPU performance is
estimated by NVProf [42], which eliminates the memory copy
time and system stack overhead.
B. Scheduling Optimization
Rubik incorporates both the hardware accelerator design
and mapping methodology based on the reordered graph. In
this section, we first analyze the impact of graph reordering
which aims to improve the data reuse of non-Euclidean data.
Specifically, we compare the following three strategies on
Rubik platform: 1) Index-order: compute with the index order
of nodes; 2) LSH-Reordering (LR): compute the nodes in
the reordered order after LSH-based graph reordering; 3)
Reordering&Computation results Reuse (LR&CR): reuse the
intermediate aggregation computation results in the G-C cache,
with the reordered input graphs.
Performance Comparison. We compare the speedup of the
latter two strategies over the first one, as shown in Figure 9(a)
and (b). We make the following observations: 1) Reordered
graph generally improves the performance with the speedup
of about 3.14x and 2.59x for GraphSage and GIN, across the
datasets with different degree distributions and feature dimen-
sion sizes. 2) For input graphs with larger degrees, reusing the
graph-level intermediate computation results (LR&CR) brings
significant speedup. As shown in Figure 9, COLLAB has an
average degree of 32 and it achieves 15.5x speedup by reusing
the aggregation results during GIN training.
Off-chip Memory Traffic Reduction. We further analyze the
off-chip memory access reduction with dataflow optimization.
The off-chip memory access volume of these three strategies
is shown in Figure 9(c) and (d). Generally, compared to index-
order execution, LR graph reordering reduces 69% and 58% of
the off-chip memory access traffics for GraphSage and GIN.
For the large sparse graphs with a large average degree, such
91 1 1 1 1 1
0
.1
2
 
0
.7
3
 
0
.2
8
 
0
.4
8
 
0
.0
7
 
0
.8
9
 3
.1
6
 
1
3
.8
8
 
4
.6
1
 
8
.1
0
 
6
1
.0
9
 
2
6
7
.1
6
 
NN-Acc Rubik GPU
1 1 1 1 1 1
0
.1
3
 
0
.8
6
 
0
.3
9
 
0
.6
2
 
0
.1
3
 
0
.6
8
 
2
8
.3
8
 
5
7
.0
8
 
4
5
.2
7
 
1
3
2
.1
9
 
1
8
9
.2
0
 
9
3
5
.1
2
 
0
1
2
3
4
5
N
o
rm
a
li
z
e
d
 E
n
e
rg
y
NN-Acc Rubik GPU
1 1 1 1 1 1
1
4
.1
6
 
1
.3
5
 
4
.5
7
 2.4
2
 
6
.5
6
 
1
.9
8
 
3
.3
4
 0
.3
9
 
1
.0
9
 
0
.5
3
 
0
.7
2
 
0
.0
4
 
0
2
4
6
8
10
12
14
16
18
20
S
p
e
e
d
u
p
 (
x
)
NN-Acc Rubik GPU
1 1 1 1 1 1
7
.2
3
 
1
.5
1
 
5
.4
9
 2.7
2
 
1
2
.0
5
 
1
.3
0
 
19.26 
2
.7
1
 
6
.9
5
 
6
.3
9
 
1
.3
1
 
0
.1
2
 
NN-Acc Rubik GPU
                             (a)  GIN                                                                  (b) GraphSage                                                                           (c) GIN                                                                        (d) GraphSage
Fig. 8. Speedup and energy comparison for NN-like accelerator, Rubik, and GPU.
1 1 1 1 1 1
2
.4
9
 
1
.4
0
 
4
.6
8
 
2
.3
6
 
2
.6
8
 
1
.9
6
 
1
5
.5
4
 
1
.4
0
 
4
.6
3
 
2
.3
9
 
1
3
.7
7
 
2
.9
1
 
0
2
4
6
8
10
12
14
16
18
20
S
p
e
e
d
u
p
 (
x
)
Index-Order LR LR&CR
0
.3
9
 
0
.7
2
 
0
.2
3
 
0
.4
3
 
0
.3
7
 
0
.3
9
 
0
.0
5
 
0
.7
1
 
0
.2
3
 
0
.4
3
 
0
.0
7
 
0
.1
6
 
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
D
R
A
M
 t
ra
ff
ic
 r
a
ti
o
Index-Order LR LR&CR
1 1 1 1 1 1
2
.1
3
 
1
.5
3
 
5
.5
3
 2
.5
5
 
5
.4
7
 
1
.6
3
 
1
0
.6
1
 
1
.5
4
 
5
.5
8
 2.6
8
 
6
0
.1
8
 
1
.9
4
 
Index-Order LR LR&CR
0
.4
6
 
0
.4
8
 
0
.1
1
 
0
.2
7
 
0
.1
8
 
0
.3
6
 
0
.0
8
 
0
.4
8
 
0
.1
1
 
0
.2
6
 
0
.0
1
 
0
.0
8
 
Index-Order LR LR&CR
Fig. 9. Speedup and off-chip memory traffic reduction under different graph scheduling&mapping strategies.
as COLLAB and Reddit, the intermediate aggregation reuse
(LR&CR) eliminates more than 90% of memory accesses in
the further step. These results consistantly show that optimiza-
tion for non-Euclidean data significantly reduces the memory
traffic and improves the memory access efficiency.
C. Speedup
We compare the performance and energy efficiency of
NN accelerator (baseline), Rubik, CPU, and GPU, with the
detailed configurations shown in Table II. For the fair of
comparison, all these architectures take in the same re-ordered
graphs.
Performance. We evaluate the execution latency of training
the entire graph for one epoch and compare it with the base-
lines, as shown in Figure 8(a) and (b). Overall, Rubik shows
speedups of 1.35x to 14.16x compared to NN accelerator
when running GIN model. Meanwhile, Rubik achieves 1.30x
to 12.05x of speedup when running GraphSage.
We further compare Rubik with the GPU platform and
provide the following observations.
1) Larger graphs with high dimension size and node
volumes are more performance-sensitive to the data reuse
optimizations. When training GraphSage models, Ru-
bik achieves 9.18x and 10.76x of speedup for Reddit and
Citeseer-S with a large graph scale. While GPU outperforms
Rubik when training small graphs, such as COLLAB, BZR,
IMDB, and DD. The key reason is that their memory footprint
is too small and most feature data and weight data can be
held in the on-chip memory hierarchy thus training GCNs
becomes computing-bound. For larger graphs, the feature data
of nodes cannot be held in the on-chip hierarchy. Additionally,
in GCNs, most of the operations are based on matrix-vector
multiplication, which has a much larger mem/compute ratio
than that of matrix-matrix multiplications. Thus the data reuse
optimization plays a more important role for larger graphs.
Consistently, Rubik achieves a larger speedup compared to
GPU on Reddit and Citeseer-S.
2) Deeper GCN models are more performance-sensitive to
the data reuse optimizations. GIN model, which has deeper
layers (5 Sageconv layers and 2 linear layers) than that in
GraphSage (2 SageConv layers), Rubik achieves the speedup
of 3.42x to 4.52x compared to the GPU platform even on
small graphs (COLLAB, BZR, IMDB, and DD). Overall,
Rubik achieves the speedup of 3.42x to 46.7x of speedup
across the various datasets when training GIN models.
D. Hardware overhead
We compare the performance and energy efficiency of NN
accelerator, Rubik, and GPU, with the detailed configura-
tions shown in Table II. For the power and area evaluation
of NN and Rubik accelerators, we break down the circuit
model estimation to the compute logic, memory array, and
hierarchical wires. We adopt Design Compiler under 45nm
technology for RTL synthesis of MAC array and control logic,
Micron Power Calculators for SRAM and DRAM estimation,
McPAT [43] for the NoC interconnection area and power
estimation. We conservatively run the accelerator at 500Mhz,
which comfortably satisfies the timing restraints. GPU power
is sampled by nvidia-smi, which is the tool suite provided by
NVIDIA CUDA driver.
Energy Consumption. In addition to performance compari-
son, we compare the energy consumption of Rubik, NN accel-
erator, and GPU. Energy consumption is calculated by multi-
plying the average power and the execution time. Compared to
GPU, Rubik improves energy efficiency by 26.3x to 1375.2x
across different datasets and GCN models. Compared to NN-
like accelerators, Rubik improves energy efficiency by 1.47x
to 7.92x for GIN and 1.13x to 8.20x for GraphSage. For graph-
like accelerators, Rubik improves energy efficiency by 1.60x
to 1.87x for GIN and 1.69x to 2.52x for GraphSage. Such
a relatively smaller energy consumption gap from the graph-
like accelerator than the gap from the NN-like accelerator is
caused by the large proportion of energy consumption on the
on-chip cache and DRAM memory access.
10
Area. We further evaluate the area of head of Rubik , which
mainly consists of the following components: computation
logic, on-chip buffer and queues, hierarchical interconnection,
and control logic. The computation units comprise of the
MAC arrays. The on-chip buffer and queues include the LSQ,
instruction queue, G-D Cache, G-C Cache, global buffer, and
register file, as described in Table II. In summary, under the
technology process of 45nm, Rubik has an area of 36.86 mm2.
VI. DISCUSSION
Graph-Reordering Overhead. In this work, the graph
reordering is happening in the pre-processing stage for only
once. It is based on row and column transformation according
to the LSH clustering results. LSH clustering is lightweight
and friendly for hardware parallelization. For the Reddit
dataset with 232,965 nodes, the graph reordering only requires
several seconds to complete. We compare the performance
between GPU and Rubik with/without preprocessing overhead
under the training scenario with 100 epochs, as shown in
Figure 10. Without preprocessing overhead, Rubik achieves
46.7x and 9.06x of speedup on Citeseer and Reddit. With
preprocessing overhead, Rubik still achieves 37.4x and 8.66x
speedup compared to GPU.
In addition, such an LSH-based technique can be extended
to support on-line graph reordering for batching and sampling
techniques. The LSH-clustering has the time complexity of
O(n ∗ nz ∗ |H|), where |H| is the number of the hashing
functions, and nz is the average non-zero elements in the
adjacency matrix. Supporting the on-line graph reordering will
be our future work.
2264.9
1962.9
4
8
.6
2
1
5
.6
6
0
.6
2
2
5
.6
0
1000
2000
3000
CITESEER-S REDDIT
T
im
e
(s
)
GPU Rubik Rubik+preprocessing
Fig. 10. Preprocessing overhead.
Batching and Sampling Influence. Batching and sampling
strategies are proposed to train the graph model to alleviate
the memory and computation burden for training the entire
graph data in one epoch and improve the convergence speed
as well [9], [35]. The state-of-art algorithm work [35] observes
that the training node sets with more edges are very important
for improving the convergence rate of the GCN models during
sampling or batching. Our reordering methodology greatly
helps to target the node sets with large dense connections,
thus enabling a more efficient batching and sampling method.
Additionally, the reordered graph remains useful even for
random batching or sampling because the order for temporal
data reuse stays still in the subgraphs.
VII. RELATED WORK
Graph acceleration. In observing that the graph applica-
tions exhibit the high cache miss rates and under-utilization
of memory bandwidth, abundant works have been proposed to
accelerate graph analytics applications. They can be classified
as the following categories: 1) Graph Preprocessing: In
order to improve the data access efficiency, it is necessary
to preprocess graph data that adapts the graph structure to
the hardware accelerators. For example, graph layout reorga-
nization, graph ordering [44], and graph partitioning [45]. Our
work incorporates the graph ordering techniques to improve
the data reuse of non-Euclidean dataflow during GCN training.
2) Hardware acceleration: Customized architectures have
been proposed to accelerate graph applications. Previous work
designs hardware modules to implement the gather, apply,
scatter phases in graph computing [21], [46]. Graphicionado
adopts large on-chip eDRAM for storage of the graph data to
eliminate the random accesses, and another work [46] designs
a dedicated cache hierarchy for different graph data. However,
such an on-chip design cannot efficiently handle the spatial
data reuse inside the NN-based computation. Additionally, the
computing units in graph accelerators are too lightweight for
the NN-based computation of GCN applications.
DNN Accelerators. Academia and industry have pro-
posed various architectures for the general acceleration of
DNNs [18], [19], [47], [48], which can be classified as the
temporal architectures and spatial architectures. The spatial
accelerators are based on dataflow processing, where the
processing element or ALUs form a processing datapath
for directly communicating with each other. Many advanced
dataflow optimization strategies are proposed, such as input
stationery, weight stationery, and row stationery, etc. Such
dataflow designs eliminate the overhead of loading or storing
data from and into memory hierarchy. However, the dataflow
optimizations are only applicable to Euclidean data process-
ing with regular data reuse directions or datapaths. For the
irregular graph data, there is no uniform data reuse datapath.
Therefore, our work propose a memory hierarchy design to
support both of these two dataflows to improve data access
efficiency.
GNN Accelerators. In observing the challenges of GNN
computing, some pioneering work have been proposed to
accelerate the GCN inference. Yan et al. [49] and Auten
et al. [50] propose the accelerator design for GNN networks
with pure hardware design. Yan’s work proposes the hard-
ware methodology, window sliding and window shrinking, to
improve memory efficiency. However, as we demonstrated,
processing index-order input graphs ignore the global-level
data locality. Our work decouples the hierarchical paradigms
and leverage two schemes of graph-level data locality for
feature data reuse and intermediate aggregation result reuse,
achieving better performance speedup.
VIII. CONCLUSION
The graph convolutional network (GCN) is a promising
approach to learn the inductive representation of graphs from
many application domains. To meet the demands of this new
learning method mixing the computation of graph analytics
and neural network, we propose the geometric learning accel-
erator based on spatial architectures for graph neural network
models, Rubik, and enhance memory hierarchy design to
support the data reuse of both the Euclidean and non-Euclidean
data. We further develop a lightweight graph reordering strat-
egy to improve the temporal reuse of non-Euclidean data and
eliminate workload. Finally, we evaluate Rubik accelerator
design and compare it with the existing architectural design
11
of the NN accelerator and graph accelerator on representative
GCN models and datasets. Evaluation results demonstrate that
Rubik together with our mapping method achieves significant
speedup and better energy efficiency compared with prior
designs.
REFERENCES
[1] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
J. Leskovec, “Graph convolutional neural networks for web-scale
recommender systems,” in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining,
KDD ’18, pp. 974–983, ACM. event-place: London, United Kingdom.
[2] R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix
completion,” arXiv preprint arXiv:1706.02263, 2017.
[3] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix com-
pletion with recurrent multi-graph neural networks,” in Proceedings of
the 31st International Conference on Neural Information Processing
Systems, NIPS’17, (USA), pp. 3700–3710, Curran Associates Inc., 2017.
[4] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for
scene graph generation,” in Proceedings of the European Conference
on Computer Vision (ECCV), pp. 670–685, 2018.
[5] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Fac-
torizable net: an efficient subgraph-based framework for scene graph
generation,” in Proceedings of the European Conference on Computer
Vision (ECCV), pp. 335–351, 2018.
[6] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs
for learning molecular fingerprints,” in Advances in neural information
processing systems, pp. 2224–2232, 2015.
[7] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,”
[8] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,
“Graph neural networks: A review of methods and applications,”
[9] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” in Advances in Neural Information Processing
Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
gus, S. Vishwanathan, and R. Garnett, eds.), pp. 1024–1034, Curran
Associates, Inc.
[10] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,”
[11] J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graph
convolutional networks via importance sampling,”
[12] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
“Hierarchical graph representation learning with differentiable pooling,”
in Advances in Neural Information Processing Systems 31 (S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-
nett, eds.), pp. 4800–4810, Curran Associates, Inc., 2018.
[13] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou,
“AliGraph: A comprehensive graph neural network platform,”
[14] T. D. Bui, S. Ravi, and V. Ramavajjala, “Neural graph learning: Training
neural networks using graphs,” in Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining, WSDM ’18,
(New York, NY, USA), pp. 64–71, ACM, 2018.
[15] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
of social representations,” in Proceedings of the 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
KDD ’14, (New York, NY, USA), pp. 701–710, ACM, 2014.
[16] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and
performance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–
94, 2018.
[17] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling towards
fast graph representation learning,” in Advances in Neural Information
Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grau-
man, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 4558–4567, Curran
Associates, Inc.
[18] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–
138, 2016.
[19] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in ACM Sigplan Notices, vol. 49, pp. 269–284,
ACM, 2014.
[20] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann,
“Benchmark data sets for graph kernels,” 2016.
[21] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphi-
cionado: A high-performance and energy-efficient accelerator for graph
analytics,” in 2016 49th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pp. 1–13, IEEE, 2016.
[22] R. Kaspar and B. Horst, Graph classification and clustering based on
vector space embedding, vol. 77. World Scientific, 2010.
[23] J. Gibert, E. Valveny, and H. Bunke, “Graph embedding in vector
spaces by node attribute statistics,” Pattern Recognition, vol. 45, no. 9,
pp. 3072–3083, 2012.
[24] A. G. Duran and M. Niepert, “Learning graph representations with
embedding propagation,” in Advances in neural information processing
systems (NIPS), pp. 5119–5130, 2017.
[25] H. Chen, X. Li, and Z. Huang, “Link prediction approach to collaborative
filtering,” in Proceedings of the 5th ACM/IEEE-CS Joint Conference on
Digital Libraries (JCDL), pp. 141–142, IEEE, 2005.
[26] J. Kunegis and A. Lommatzsch, “Learning spectral graph transforma-
tions for link prediction,” in Proceedings of the 26th Annual International
Conference on Machine Learning (ICML), pp. 561–568, 2009.
[27] T. Tylenda, R. Angelova, and S. Bedathur, “Towards time-aware link
prediction in evolving social networks,” in Proceedings of the 3rd
workshop on social network mining and analysis, pp. 1–10, 2009.
[28] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[29] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural
networks on graphs with fast localized spectral filtering,” in Proceedings
of the 30th International Conference on Neural Information Processing
Systems, NIPS’16, (USA), pp. 3844–3852, Curran Associates Inc., 2016.
[30] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets:
Graph convolutional neural networks with complex rational spectral
filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–
109, 2018.
[31] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural
networks for graphs,” in International conference on machine learning,
pp. 2014–2023, 2016.
[32] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph
neural networks?,”
[33] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan: Gated
attention networks for learning on large and spatiotemporal graphs,”
arXiv preprint arXiv:1803.07294, 2018.
[34] V. Zayats and M. Ostendorf, “Conversation modeling on reddit us-
ing a graph-structured lstm,” Transactions of the Association for
Computational Linguistics, vol. 6, pp. 121–132, 2018.
[35] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh,
“Cluster-gcn: An efficient algorithm for training deep and large graph
convolutional networks,” in Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining,
KDD ’19, (New York, NY, USA), pp. 257–266, ACM, 2019.
[36] M. Girvan and M. E. Newman, “Community structure in social and
biological networks,” Proceedings of the national academy of sciences,
vol. 99, no. 12, pp. 7821–7826, 2002.
[37] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx-
imate nearest neighbor in high dimensions,” Commun. ACM, vol. 51,
pp. 117–122, Jan. 2008.
[38] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt,
“Practical and optimal lsh for angular distance,” in Proceedings of
the 28th International Conference on Neural Information Processing
Systems - Volume 1, NIPS’15, (Cambridge, MA, USA), pp. 1225–1233,
MIT Press, 2015.
[39] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive
hashing scheme based on p-stable distributions,” in Proceedings of the
Twentieth Annual Symposium on Computational Geometry, SCG ’04,
(New York, NY, USA), pp. 253–262, ACM, 2004.
[40] R. A. Rossi and N. K. Ahmed, “The network data repository with
interactive graph analytics and visualization,” in AAAI, 2015.
[41] M. Fey and J. E. Lenssen, “Fast graph representation learning with
PyTorch geometric,”
[42] Nvidia., “Cuda toolkit documentation.”
[43] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen,
and N. P. Jouppi, “Mcpat: An integrated power, area, and timing
modeling framework for multicore and manycore architectures,” in 2009
42nd Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), pp. 469–480, Dec 2009.
[44] V. Balaji and B. Lucia, “When is graph reordering an optimization?
studying the effect of lightweight graph reordering across applications
12
and input graphs,” in 2018 IEEE International Symposium on Workload
Characterization (IISWC), pp. 203–214, Sep. 2018.
[45] D. Chakrabarti, “Autopart: Parameter-free graph partitioning and outlier
detection,” in European Conference on Principles of Data Mining and
Knowledge Discovery, pp. 112–124, Springer, 2004.
[46] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and
O. Ozturk, “Energy efficient architecture for graph analytics acceler-
ators,” in 2016 ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA), pp. 166–177, June 2016.
[47] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and
E. S. Chung, “Accelerating deep convolutional neural networks using
specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11,
pp. 1–4, 2015.
[48] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee,
J. M. Hernndez-Lobato, G. Wei, and D. Brooks, “Minerva: En-
abling low-power, highly-accurate deep neural network accelerators,” in
2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), pp. 267–278, June 2016.
[49] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan,
and Y. Xie, “Hygcn: A gcn accelerator with hybrid architecture,” in
2020 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pp. 15–29, 2020.
[50] A. Auten, M. Tomei, and R. Kumar, “Hardware acceleration of graph
neural network,” in 2020 IEEE International Symposium on Design
Automation Conference (DAC), pp. 1–6, 2020.
