Deep Graph Library Optimizations for Intel(R) x86 Architecture by Avancha, Sasikanth et al.
DEEP GRAPH LIBRARY OPTIMIZATIONS FOR INTEL® X86
ARCHITECTURE
A PREPRINT
Sasikanth Avancha
Parallel Computing Lab, Intel Labs
Intel Corporation
Bangalore, India
sasikanth.avancha@intel.com
Vasimuddin Md.
Parallel Computing Lab, Intel Labs
Intel Corporation
Bangalore, India
vasimuddin.md@intel.com
Sanchit Misra
Parallel Computing Lab, Intel Labs
Intel Corporation
Bangalore, India
sanchit.misra@intel.com
Ramanarayan Mohanty
Parallel Computing Lab, Intel Labs
Intel Corporation
Bangalore, India
ramanarayan.mohanty@intel.com
July 14, 2020
ABSTRACT
The Deep Graph Library (DGL) was designed as a tool to enable structure learning from graphs, by
supporting a core abstraction for graphs, including the popular Graph Neural Networks (GNN). DGL
contains implementations of all core graph operations for both the CPU and GPU. In this paper, we
focus specifically on CPU implementations and present performance analysis, optimizations and
results across a set of GNN applications using the latest version of DGL (0.4.3). Across 7 applications,
we achieve speed-ups ranging from 1.5×-13× over the baseline CPU implementations.
1 Introduction
Graph Neural Networks (GNN) [1, 2, 3, 4, 5] are a very important class of Neural Network algorithms for learning
the structure of large, population-scale graphs. Often, GNNs are combined with traditional graph structure discovery
algorithms via traversal (e.g., Breadth-First Search (BFS), Depth-First Search (DFS), RandomWalk) to achieve higher
accuracy in learning their structure. Given a graph G = (V, E), the neural network formulation in GNNs implies that,
unlike graph traversal algorithms, they attempt to learn structure of G via low-dimensional representations associated
with V or E or both. GNN algorithms, broadly, learn these representations in two parts: in feature vectors Fv and/or Fe
associated with V and E , respectively and a set of graph-wide, shared parameters W . Via a recursive process called
Aggregation, GNNs encode multi-hop neighborhood representations in Fv and/or Fe. Depending upon the specific
algorithm and the task (e.g., node classification, link prediction etc.), feature vectors aggregation precedes or succeeds a
shallow neural network, typically consisting one or more linear transforms followed by a classification or regression
model etc; some algorithms additionally employ a self-attention mechanism.
Given that aggregation is the core operation in all GNN training and inference algorithms, our focus in this paper is on
accelerating aggregation performance on Intel® Xeon® high-performance CPUs. Let t = (u, v, e) be a tuple, where
e is the edge, and u ∈ U and v ∈ V are the source and destination vertices, respectively. Inherently, the aggregation
operation involves message passing between any two entities in t. DGL implements the aggregation operation via two
basic primitives: send(x) and recv(y,⊕), where x, y ∈ t and ⊕ is a reduction operation. We observed that DGL fuses
send and recv into a single primitive fused_sr(x, y,⊕) when aggregation consists of a simple arithmetic operation (as
described in [6]); DGL implements built-in fused aggregation primitive in such cases.
ar
X
iv
:2
00
7.
06
35
4v
1 
 [c
s.D
C]
  1
3 J
ul 
20
20
A PREPRINT - JULY 14, 2020
DGL implements unary aggregation primitives (e.g. copy_u and copy_e) which reduce a set of source features into
the destination feature. In DGL parlance, the unary aggregation primitive is called Copy-Reduce. Similarly, DGL
implements binary aggregation primitives (e.g. u_mul_e_add_v), which first perform an element-wise binary operation
on two input feature vectors, and reducing the result into the destination feature via message passing. In DGL parlance,
the binary aggregation primitive is called Binary-Reduce. We discuss these primitives in greater detail in the next
section.
As Wang et al. describe in [6], the DGLGraph interface hides the details of the graph data structure (e.g., CSR or COO)
from the programmer to enable better productivity. However, this implies that the application performance using DGL
for GNN training and inference depends on how well the graph data structure and its associated operations have been
optimized for the underlying architecture. Our analysis of various applications using DGL revealed that the aggregation
operation is implemented using sub-optimal primitives (such as serialization, explicit buffer copies prior to reduction),
resulting in low performance on the Intel® Xeon® processor family. In this paper, we optimized the aggregate
primitives in DGL on CPU and demonstrated the per epoch time speedups as high as 13× on DGL GNN applications.
The rest of the paper is organized as follows. Section 2 describes the aggregate primitives used by DGL. Section 3
describes the optimizations applied to binary-reduce and copy-reduce primitives. Section 4 discusses primitives
implemented in the PyTorch framework that impact GNN performance, and their optimizations. In Section 5, we
discuss the results of our optimizations and show performance improvements across various GNN applications on
Intel® Xeon® processors. Section 6 concludes the paper.
2 Aggregation Primitives
Our understanding and analysis of the aggregation primitives in DGL lead to the conclusion that these operations can be
represented as a sequence of linear algebraic expressions involving node and edge features, along with the appropriate
operators. The expression is sufficient to describe the aggregation over the complete graph or sub-graph upon which the
operation executes.
2.1 Binary-Reduce (BR)
As discussed in section 1, BR consists of a sequence of two operations – an element-wise binary operation between
a pair of feature vectors, and an element-wise operation that reduces the intermediate feature vector into the output
feature vector. When applied over the whole graph, the operands may be considered as multi-dimensional tensors
representing nodes and edges.
Equation 1 shows a mathematical representation of BR, with operators ⊗ (element-wise binary operator) and ⊕
(element-wise reduction operator), and feature vector operands x, y and z. x and y are inputs to ⊗, ⊗(x, y) and z are
inputs to ⊕; the final result is in z.
BR(x, y,⊗,⊕, z) : ⊕(⊗(x, y), z), (1)
∀x, y, z ∈ G(V, E)
Figure 1 shows an example subgraph induced on a (possibly) larger directed graph, and rooted at node v00. In the
example, all nodes labeled u1x are 1-hop neighbors of v00; similarly nodes u2x are its 2-hop neighbors. Now, for
example, u_mul_e_add_v BR between nodes u20 and u12, with feature vectors Fu20 and Fu12, respectively and
edge-feature vector Fe2012 would have the following expression:
BR21(Fu20, F e2012,×,+, u12) :
t← Fu20 × Fe2012
u12 ← u12 + t (2)
Further, we observe that nodes u11 and u13 will not be part of BR on the subgraph rooted at v00 because there are no
edges from them to v00.
If the size and shape of the input feature vectors are not equal, and if one of them has size 1, then BR broadcasts the
smaller feature vector to the dimension of the larger one; in all other cases, BR will fail to execute.
2
A PREPRINT - JULY 14, 2020
Figure 1: Example directed subgraph showing aggregation direction. For node v00, directed edges from all u2x to u1x
to v00 will be part of BR operation.
2.2 Copy-Reduce (CR)
DGL implements the CR operation separately from BR as it is widely used in GNN applications without any binary
operation. Therefore, we view it as a special class of BR. CR takes only one input operand associated with the source
(node or edge) and passes the feature vector as a message (i.e., copies it to the destination (node or edge), where it is
reduced onto the latter.
As shown in Equation 3, CR can be mathematically represented using BR syntax with y replaced with NULL or φ,
resulting in ⊗(x, y) becoming a unary operation copy(x):
BR(x, φ,⊗,⊕, z) : ⊕(⊗(x, φ), z)
⊗(x, φ) = copy(x)
BR(x, φ,⊗,⊕, z)⇒ CR(x, copy,⊕, z)
CR(x, copy,⊕, z) : ⊕(copy(x), z),∀x, z ∈ G(V, E) (3)
For example, in Figure 1, with sum (+) as reduction operation, CR between u10 and v00 can be expressed as:
CR21(u10, copy,+, v00) :
t← copy(u10)
v00 = v00 + copy(u10) (4)
2.3 Configurations of Binary-Reduce and Copy-Reduce
Various configurations of BR arise as a result of multiple candidates for each input operand and reduction destination.
Here, we present the comprehensive list of configurations of BR and CR primitives implemented in DGL. (Table 1).
Table 1: Various configurations of BR and CR primitives inbuilt in DGL.
BR CR
u_⊗ _v_⊕ _v, v_⊗ _u_⊕ _v, u_copy_⊕ _v, e_copy_⊕ _v
u_⊗ _v_⊕ _e, v_⊗ _u_⊕ _e,
u_⊗ _e_⊕ _v, e_⊗ _u_⊕ _v,
u_⊗ _e_⊕ _e, e_⊗ _u_⊕ _e,
v_⊗ _e_⊕ _v, e_⊗ _v_⊕ _v,
v_⊗ _e_⊕ _e, e_⊗ _v_⊕ _e
3
A PREPRINT - JULY 14, 2020
DGL has built-in support for a set of configurations in which ⊗ ∈ {add, sub,mul, div, dot} and ⊕ ∈
{add,max,min,mul, div, copy}. In practice, DGL showed that these configuration are enough to support a large
majority of applications. Our evaluation showed that even with these simple operations, the BR primitive executes for a
majority of the run-time across various applications (described in the Section 5). We profiled 7 GNN applications (total
8 instances) and the BR and CR primitives used by them (Table 2).
Table 2: GNN applications and BR and CR configurations used by them.
Application BR Configurations
1. GCN (u_copy_add_v)
2. GCN-Sampled (u_copy_add_v)
3. GraphSAGE (u_copy_add_v)
4. GraphSAGE-Sampled (u_copy_add_v)
5. GCMC (u_copy_add_v), (u_dot_v_add_e)
6. Line Graph (u_copy_add_v)
7. Monet (u_mul_e_add_v)
8. GAT
(e_copy_add_v), (e_copy_max_v),
(u_add_v_copy_e), (e_sub_v_copy_e),
(e_div_v_copy_e), (u_mul_e_add_v),
(v_mul_e_copy_e)
9. RGCN-Hetero (u_copy_add_v)
2.4 Baseline Implementations of BR and CR in DGL
The graph adjacency matrix in DGL is in Compressed Sparse Row (CSR) format. The CPU implementation first
loads the features of u, v and/or e, as required, for each row offset (representing the source node) and corresponding
column indices (representing destination nodes). Using these feature vectors, it performs BR or CR for the tuple
(u, v, e). Specifically, to execute CR, DGL implements a push model. By push, we mean that in ⊕(copy(xk), zk−1)
executes from hop k to k − 1, so x ∈ hop k and z ∈ hop k − 1. To achieve good performance for CR on the CPU, the
implementation parallelizes the loop over rows of the CSR matrix (i.e., the source nodes, xk). Since CR is an integral
part of BR, we first focus on the problems associated with parallel execution in CR. As shown in Equation 1, the
binary operation ⊗ is straightforward, executes before ⊕ and can be parallelized easily, with the result stored in some
temporary feature t.
Algorithm 1 describes the baseline push model in DGL’s CR implementation.
Algorithm 1 Copy-Reduce: Push
1: for all source nodes u ∈ V in parallel do
2: copy_u(u, out) {DGL function that copies source feature vector to out}
3: for all destination nodes v ∈ N(u) in serial do
4: Fv ← Fv ⊕ out
5: end for
6: end for
When different nodes u share neighbor v and if the CR destination is nodes v, then the CR operation results in race
condition among the threads. DGL employs serialization using critical sections to resolve the race condition. The
serialization significantly impacts the performance, leading to slower application run times.
Also, the push approach to CR is scatter-heavy given that the graph adjacency matrix is more than 99.9% sparse; this
bounds CR performance by memory access latency to randomly scattered destination node addresses in memory. We
also observed via profiling and analysis that there is a potential reuse proportional to the average node degree. However,
the push model fails to make use of this reuse because it simply scatters the feature vector to different addresses. This
results in a significant amount of wasted memory bandwidth.
3 Optimizing Aggregation Primitives
BR and CR account for a majority of the run-time in GNN applications. In this section, we describe techniques we
have created to optimize their implementations within DGL. As discussed in Section 2.4, achieving high performance
4
A PREPRINT - JULY 14, 2020
for CR is critical to BR performance as well; it is also potentially harder to achieve. Therefore, we first focus on CR
optimizations.
3.1 Copy-Reduce
To avoid the problems associated with the default push model, DGL provides a way to pull messages from nodes u
and reduce them into nodes v. Now, parallelizing the CR operation by distributing v across OpenMP threads will not
result in collisions because only one thread owns all feature vector vectors Fv at each destination node v and reduces
each pulled source feature vectors into Fv (Algorithm 2).
Algorithm 2 Copy-Reduce: Pull
1: for all destination nodes v ∈ V in parallel do
2: copy_u(u, out) {DGL function that copies source feature vector to out}
3: for all source nodes u ∈ N(v) in parallel do
4: Fv ← Fv ⊕ out
5: end for
6: end for
While Algorithm 2 solves the collision problem of Algorithm 1, it still does not solve either the feature vector reuse
problem (to reduce wasted memory bandwidth) or the memory latency problem due to random access pattern of source
addresses. It turns from a scatter-heavy algorithm to a gather-heavy one. To solve the problems associated with both
push and pull, we further optimize this algorithm and implement a variant of the Sparse-Dense Matrix Multiply
operation that Wang et al. [6] allude to.
The neighborhood graph is represented as a sparse matrix, the adjacency matrix in CSR format. The dense matrix
consists of the feature vectors Fu or Fe associated with source nodes u or edges e, respectively. Algorithm 3 shows the
details of our optimized CPU implementation of CR, for u_copy_add_v configuration.
Algorithm 3 Copy-Reduce: Pull Optimized
Require: A - Matrix of size M ×K in CSR format
Require: B - Dense matrix of size K ×N
Require: C - Dense matrix of size M ×N
Require: Reduction-operator: ⊕
Require: N = length(Fu), kb = block-size on K dimension, nb = block-size on N dimension
Require: C ← 0
1: for r ∈ 0, . . . ,M − 1 in parallel do
2: for c ∈ 0, . . . ,K − 1, step kb do
3: B[c, . . . , c+ kb]← RadixSort(B[c, . . . , c+ kb])
4: for n ∈ 0, . . . , N − 1, step nb do
5: C[r][n]← + = B[c][n] {C[r] and B[c] are N-wide vectors}
6: end for
7: end for
8: end for
The critical part of this formulation is that the rows and columns of the sparse matrix A represent the destination (M)
and source (K) nodes, respectively and the dense matrix B consists of the source node feature vectors Fu. Thus, the
output matrix C consists of feature vectors of destination nodes Fv, v ∈ Neighborhood(u), reduced from multiple
source nodes Fu. Given that A is an adjacency matrix in CSR format, each row (i.e., v) only consists of column indices
of connected source nodes u. So, in effect, the matrix multiply operation is to select those rows (i.e., source nodes u) of
feature vectors Fu from B that reduce into rows (i.e., destination node v) feature vectors Fv in C.
To achieve high performance, Algorithm 3 contains two primary optimizations:
1. Parallelizes over rows of A (and C). This optimization is similar to that in Algorithm 2, where threads own
destination nodes, and thus, there is no collision problem that we observe in Algorithm 1.
2. Takes advantage of the reuse present in the graph, and avoids random gathers by:
(a) Blocking the K dimension of A and B, ensuring that all threads work on one block of kb source nodes at
a time,
(b) Sort the block of rows in B according to row-id using Radix Sort, and
5
A PREPRINT - JULY 14, 2020
Algorithm 4 Binary-Reduce: (Node, Node, Any)
Require: Matrix A of size M ×K in CSR format
Require: Matrix E of size M2 ×K in CSR format (Incidence)
Require: Matrix ET is K ×M2 in CSR format (Incidence)
Require: Feature matrix Vf of size M × d
Require: Feature matrix Ef of size M2 × d
Require: Input operands: X (Nodes), Y (Nodes)
Require: Output operand: Z (Edges)
Require: Binary-operator: ⊗, Reduction-operator: ⊕
1: for u ∈ 0, . . . ,M − 1 in parallel do
2: Fu ← Vf [u]
3: for v in A[u] do
4: Fv ← Vf [v]
5: if Z = U then
6: Vf [u]← Fu ⊕ (Fu ⊗ Fv) {Reduction Destination: source nodes u}
7: else if Z = V then
8: Vf [v]← Fv ⊕ (Fu ⊗ Fv) {Reduction Destination: destination nodes v}
9: else if Z = E then
10: e← ET [v]
11: Ef [e]← Fu ⊗ Fv {Copy Destination: Edges}
12: end if
13: end for
14: end for
(c) Block the N dimension of B and C to process nb feature vector elements at a time
Due to 2(a), any feature vector in B read by some thread t could be in the L2 cache of the CPU if/when some other
thread t′ reads the same feature vector. Due to 2(b), accesses of source node feature vectors from DRAM are not
completely random, but in ascending order of addresses - which should help reduce DRAM access latency. Due to 2(c),
all threads work only on a block of C of size M× nb at a time, where nb is the block size. We use a value of nb such
that the block of C stays in the Last Level Cache (LLC) of the CPU until it is completely processed.
3.2 Binary-Reduce
We focus now on optimizing the binary operation within BR, applying the optimized Algorithm 3 to handle the CR part.
Algorithms 4, 5, 6 describe the optimized BR for different configurations of input and output operations.
Our optimizations consist of three major steps.
1. Of the two input operands, gather the features of the second operand corresponding to each instance of the first
operand, as required by the binary operation.
2. Perform the element-wise binary operation (⊗) on the two operands.
3. Reduce the dense matrix generated using ⊕. If the reduction destination is a node, then apply CR on the node
feature matrix. If the reduction destination is an edge, copy the result of Step 2 to the dense edge feature
matrix.
To clarify the usage of various BR configurations, we have shown three algorithms: (Node, Node, Any), (Node, Edge,
Any) and (Edge, Node, Any) in Algorithms 4, 5 and 6, respectively.
In Algorithm 4, for each source node u, we load feature Fu and gather connected destination node features Fv (line 4).
Depending on whether the final destination of reduction or copy is u, v or the edge between them e, lines 6, 8 and 11,
scatter the result Fu ⊗ Fv to the node-feature matrix Vf or Ef , respectively.
In Algorithm 5, the second operand is the edge incident on u; therefore, we must first obtain the edge index e from the
incidence matrix ET , gather its feature Fe from edge-feature matrix Ef and then perform ⊗ followed by reduction or
copy on lines 7, 9, or 11, respectively, corresponding to the final destination.
In Algorithm 6, the first operand is the set of all edges E; in line 2, we load each edge-feature Fe; the second operand is
the set of nodes V upon which e is incident; therefore, in line 4, we gather node-features Fu∀u ∈ V ; again, depending
on the final destination, we reduce or copy Fe ⊗ Fu to Vf or Ef in lines 6, 8 and 10, respectively.
6
A PREPRINT - JULY 14, 2020
Algorithm 5 Binary-Reduce: (Node, Edge, Any)
Require: Matrix A of size M ×K in CSR format
Require: Matrix E of size M2 ×K in CSR format (Incidence)
Require: Matrix ET is K ×M2 in CSR format (Incidence)
Require: Feature matrix Vf of size M × d
Require: Feature matrix Ef of size M2 × d
Require: Input operands: X (Nodes), Y (Edges)
Require: Output operand: Z (Any)
Require: Binary-operator: ⊗, Reduction-operator: ⊕
1: for u ∈ 0, . . . ,M − 1 in parallel do
2: Fu ← Vf [u]
3: for v in A[u] do
4: e← ET [v]
5: Fe ← Ef [e]
6: if Z = U then
7: Vf [u]← Fu ⊕ (Fu ⊗ Fe) {Reduction Destination: source nodes u}
8: else if Z = V then
9: Vf [v]← Fv ⊕ (Fu ⊗ Fe) {Reduction Destination: destination nodes v}
10: else if Z = E then
11: Ef [e]← Fu ⊕ Fe {Copy Destination: Edges}
12: end if
13: end for
14: end for
Algorithm 6 Binary-Reduce: (Edge, Node, Any)
Require: Matrix A of size M ×K in CSR format
Require: Matrix E of size M2 ×K in CSR format (Incidence)
Require: Matrix ET is K ×M2 in CSR format (Incidence)
Require: Feature matrix Vf of size M × d
Require: Feature matrix Ef of size M2 × d
Require: Input operands: X (Edges), Y (Nodes)
Require: Output operand: Z (Any)
Require: Binary-operator: ⊗, Reduction-operator: ⊕
1: for e ∈ 0, . . . ,M2 − 1 in parallel do
2: Fe ← Ef [e]
3: for u in E[e] do
4: Fu ← Vf [u]
5: if Z = U then
6: Vf [u]← Fu ⊕ (Fe ⊗ Fu) {Reduction Destination: source nodes u}
7: else if Z = V then
8: Vf [v]← Fu ⊕ (Fe ⊗ Fu) {Reduction Destination: destination nodes v}
9: else if Z = E then
10: Ef [e]← Fe ⊕ Fu {Copy Destination: Edges}
11: end if
12: end for
13: end for
As can be seen, Algorithm 3 is critical for the performance of both BR and CR operations. The algorithm is designed
and optimized for small input matrices, usually occurring in applications that sample and batch the input graph for
processing. However, the algorithm, right now, is not fully optimized for large input matrices, usually occurring in
applications processing full graph in non-batched mode. Thus, for applications with full graph processing we make use
of mkl_sparse_?_mm() MKL matrix multiplication kernel.
4 PyTorch Primitives
We used PyTorch as the backend to execute DGL and the neural network functions, e.g., Linear layer. Our application
profiles indicated that a number of PyTorch primitives execute sub-optimally on the CPU. Of these, BatchNorm1d and
Embedding accounted for a significant amount of run-time in the Line Graph Neural Network (LGNN) application.
7
A PREPRINT - JULY 14, 2020
BatchNorm1d did not have an implementation within MKLDNN for PyTorch; therefore, we created an optimized
version in a PyTorch extension by parallelizing across the samples and vectorizing across features per sample. The
Embedding primitive in PyTorch is similar to Copy-Reduce in terms of operations: gather a set of feature vectors using
index vectors and copy them into destination vectors in the Forward pass; scatter-reduce the gradients of Embedding
weights in the Backward pass.
5 Results
In this section, we demonstrate the performance benefits of optimized aggregation and other primitives in various GNN
applications implemented in DGL.
5.1 Applications
We analyzed and optimized seven applications that are implemented using DGL and available within the DGL Github
repository https://github.com/dmlc/dgl/. We briefly discuss these applications.
• GCN [2] is a semi-supervised learning approach on graph-structured data that applies the notion of convolutions
on graphs. In each layer, it applies linear transforms to regularized node features and normalizes them before
aggregation.
• GraphSAGE [1] is a general inductive framework that uses node features to generate node embeddings for data
unseen by the network. For each node u, it aggregates neighbor v features Fv and concatenates the aggregated
Fv to Fu before applying a linear transform.
• Relational GCN (R-GCN) [5] is a GNN that applies the GCN framework to relational graphs. For each node u,
it first aggregates linearly transformed neighbor feature Fv under relation r with Fu and then aggregates them
across all relations r ∈ R.
• Line Graph Neural Network (LGNN) [7] is an instance of a GNN that employs both node feature aggregation
as well as edge-feature aggregation. Thus, there are two sequential aggregation steps that make this application
particularly suitable for our optimization.
• MoNet [8] is a general framework for applying GCN to replace previous methods of learning on non-Euclidean
spaces, such as Geodesic CNN and Anisotropic CNN. In the DGL implementation, the core aggregation step
is u_mul_e_add_v (transform node features multiplied by Gaussian weights on the edges) followed by a sum,
mean of max operation on the resulting feature vectors.
• Graph Convolutional Matrix Completion (GC-MC) [9] is a graph-based auto-encoder framework for matrix
completion that uses GCN for recommender systems. In the DGL implementation, the aggregation operation
is copy_u(u, out) followed by sum reduction.
• Graph Attention Networks (GAT) leverage masked self-attentional layers by stacking layers in which nodes
attend over their neighborhoods’ features. In this paper, we analyze GAT performance as applied to life-
sciences applications such as molecules property prediction.
5.2 Experimental Evaluation
5.2.1 Experimental Setup
We performed all the experiments on Intel® Xeon® 8280 CPU @2.70GHz with 28 cores (single socket), equipped
with 98 GB of memory per socket. The peak bandwidth to DRAM on this machine is 128 GB/s. We used gcc v7.1.0
compiler for compiling DGL and the backend PyTorch neural network framework from source code.
We used the latest release of DGLv0.4.3 to demonstrate the performance enhancements due to our optimizations. We
used Pytorch v1.6.0-rc1 as the backend for all our experiments. All the applications execute with default parameter
settings. We used Pytorch autograd profiler to profile the applications.
Table 3 shows the details of the datasets used in our experiments. Additionally, we used MovieLens-1M (ML-1M)
dataset for GC-MC application and a synthetic dataset built using stochastic block model (SBM) for LGNN application.
ML-1M is a benchmark dataset based on the user ratings for the movies; it consists of 6,040 users, 3,706 movies,
1,000,209 ratings with rating levels 1, 2, . . . , 5. And, SBM is a synthetic dataset consists of random graph model with
planted clusters. We used the default input parameters to generate the dataset.
8
A PREPRINT - JULY 14, 2020
Figure 2: Training performance comparison of CPU optimizations against CPU baseline on DGL over seven different
GNN applications processing full graph in non-batched mode. The speedup by optimized code is mentioned on the
top of the optimized bar. The Misc. is the run-time of all the remaining components. The performance numbers
are averaged over 10 epochs, except for LGNN, where we used 3 epochs. The datasets used for the experiments are
mentioned in the charts. Here, BR primitive represents the time for both BR and CR.
Figure 3: Training performance comparison of CPU optimizations against CPU baseline of GraphSAGE application
with sampled graph processing on Amazon OGB-product and Reddit datasets. The speedup by optimized code is
mentioned on the top of the optimized bar. The Misc. is the runtime of all the remaining components. The performance
numbers are averaged over 10 epochs. Here, BR primitive represents the time for both BR and CR.
5.2.2 Performance Evaluation of DGL
We compared the performance of optimized DGL against the baseline (i.e non-optimized) DGL. We ran all the seven
applications with non-batched (full graph) processing; moreover, we also experimented with GraphSAGE with batched
graph processing (sampled) (Figure 2 and Figure 3). We used the biggest of the benchmark datasets provided in the
DGL for these applications. For GraphSAGE, we also show performance results for a bigger dataset – the Amazon
ogb-products dataset from https://ogb.stanford.edu/docs/nodeprop/.
9
A PREPRINT - JULY 14, 2020
Table 3: Benchmark graph dataset
Datasets #nodes #edges #features #classes
Pubmed 19,717 44,338 500 3
Reddit 232,965 11,606,919 602 41
Amazon OGB-Products 2,449,029 123,718,280 100 47
BGS 44,333 227,916 103 (Relations) 2
Overall, for applications with non-batched processing, we observed a speedup of 1.6×−12.8× on per epoch time over
the baseline DGL on the CPU; specifically, we observe BR speedup between 1.72×−34× per epoch time compared
to the DGL baseline across the seven application (Figure 2). Similarly, for GraphSAGE with batched processing,
we see overall speedup of 1.5×-1.7× per epoch over DGL baseline; specifically, we observe BR speedup between
7.2×−10.6× per epoch over DGL baseline (Figure 3). All our optimizations ensure the same accuracy as the baseline
DGL.
Our optimizations of BatchNorm1d and Embedding PyTorch primitives (in LGNN application) resulted in 13× and
76× respectively. Together with these three optimized primitives optimized LGNN achieves 2× speedup over baseline.
The Misc. portion of the runtimes in Figure 2 is majorly contributed by other primitives – due to Pytorch framework –
plus some DGL framework overheads. These PyTorch primitives can be optimized on similar lines as 1D Native Batch
Norm and Embedding primitives.
6 Conclusions
Aggregation operations are critical to Graph Neural Network applications functionality. Via extensive application
profiling and analysis of their implementations in the popular DGL, we observed that aggregation primitives account for
a majority of the run-time across applications. The Binary-Reduce abstraction in DGL is the main aggregate operation.
It is a memory-intensive operation with element-wise operations being the only compute; therefore, on CPU, the
performance of this primitive is bound by the available memory-bandwidth. We optimized the sparse-dense matrix
multiplication formulation of binary-reduce (and its special case, copy-reduce). We have demonstrated the benefits of
the optimizations across a range of GNN applications in DGL.
10
A PREPRINT - JULY 14, 2020
References
[1] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In
Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page
1025–1035, Red Hook, NY, USA, 2017. Curran Associates Inc.
[2] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In
Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017.
[3] Petar Velicˇkovic´, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph
attention networks. In International Conference on Learning Representations, 2018.
[4] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In
International Conference on Learning Representations, 2019.
[5] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling
relational data with graph convolutional networks. In The Semantic Web - 15th International Conference, ESWC
2018, Proceedings, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), pages 593–607. Springer/Verlag, 2018.
[6] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma,
Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J Smola, and Zheng Zhang.
Deep graph library: Towards efficient and scalable deep learning on graphs. ICLR Workshop on Representation
Learning on Graphs and Manifolds, 2019.
[7] Zhengdao Chen, Lisha Li, and Joan Bruna. Supervised community detection with line graph neural networks. In
International Conference on Learning Representations, 2019.
[8] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein.
Geometric deep learning on graphs and manifolds using mixture model cnns. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
[9] Rianne van den Berg, Thomas N. Kipf, and Max Welling. Graph convolutional matrix completion, 2017.
11
