Graph Convolutional Networks (GCNs) have emerged as the stateof-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform.
INTRODUCTION
Recently, representation learning on graphs has attracted much attention. By extracting structured, low dimensional features from the unstructured, high dimensional graph, many downstream tasks such as node classification [14, 18] , link prediction [38] , graph classification [32] and clustering [29] can be performed easily and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FPGA '20, February 23-25, 2020, Seaside, CA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7099-8/20/02. . . $15.00 https://doi.org /10.1145/3373087.3375312 effectively. Among the numerous representation learning methods, Graph Convolutional Networks (GCNs) [18] are capable of learning significantly better features than traditional methods [12, 24] . GCNs have been used to facilitate user recommendation for social networks [31] , to identify protein functionality from interaction graphs [14] and to help circuit testability analysis for EDA [20] .
Despite the popularity of GCNs, training is still expensive in terms of time and computation resources. To scale GCNs to larger graphs and to enable fast re-training on dynamic graphs, it is critical to develop accelerators. Existing works [14, 34] parallelize training on GPU and multi-core platform. How to accelerate training by exploiting the FPGA hardware is currently not well studied.
Similar to Convolutional Neural Network (CNNs), GCNs are built by iteratively stacking multiple layers. Operations of a graph convolutional layer are decomposed into two major steps, to (1) propagate information within the graph and (2) propagate information along the neural network layers. Correspondingly, FPGA accelerators need to address the below challenges to improve performance:
Memory access. The feature propagation (step 1) in a large and sparse graph incurs high volume of irregular memory accesses, both on-chip and off-chip. The memory challenge is unique to the GCN training problem. While CNN accelerators [19, 25, 30, 33, 35, 37, 39] achieve high data reuse and regular accesses by tiling and sliding windows on input tensors, such tensor operations do not generalize to sparse graphs. In addition, while graph processing accelerators [7, 8, 16, 36, 41] optimize data flow and layout for propagation of node labels, such optimizations hardly lead to performance gain when the labels are replaced with long feature vectors.
Computation. The propagation along GCN layers (step 2) involves computationally intensive dense tensor operations on the model weights and graph node features. Fast computation of this step require high consumption of hardware resources.
Load-balance. Degree-imbalance of graph nodes can significantly degrade the performance of feature propagation. Consequently, the overall layer computation (steps 1 and 2) can be load-imbalanced. It is challenging to design on-chip computation modules and the corresponding scheduler to ensure load-balance for arbitrary graphs.
We propose GraphACT, a framework to accelerate GCN training by addressing the three challenges. The main contributions are:
• We propose GraphACT with the following optimizations: -Algorithm selection: By analyzing various GCN training algorithms, we select a subgraph-based minibatch algorithm to significantly reduce CPU-FPGA communication. -Redundancy reduction: For each subgraph as a minibatch, we identify and eliminate the recurring aggregation operations between common node neighbors. Such online pre-processing on CPU significantly reduces the number of on-chip operations and BRAM accesses on FPGA. -Parallelization: We parallelize the key training steps with optimized on-chip computation modules. Integrating the modules into the processing pipeline, we achieve load-balance on a wide range of target FPGA devices. • We develop an accurate performance model for GraphACT.
The model identifies architectural parameters of the FPGA design, and algorithmic parameters of the minibatch sampler. • We evaluate GraphACT on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. We achieve an order of magnitude training time speedup with negligible accuracy loss, compared with state-of-the-art multi-core implementation.
BACKGROUND AND RELATED WORK
We use bold capital letters (e.g., X ) to denote matrices (zero indexed).
We use X u , X :,v and X u,v to denote a row, a column and an element of X , respectively. We use X [a : b, :] to retrieve a sub-matrix by extracting rows of X (from the a th to the (b −1) th row). Similarly, we use X [:, a : b] to retrieve a sub-matrix by extracting the columns. We use superscript T to denote matrix transpose. Superscript within brackets (·) denote index of a GCN layer.
Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are built by extending the convolution operation defined on grid matrices to unstructured graphs [18] . The input to a GCN is an un-directed, node attributed graph G (V, E, X ), where V and E denote the set of all nodes and edges. Each node is attributed by a length-f feature vector, so X ∈ R | V |×f denotes the feature matrix of V. A GCN embeds each graph node into a f ′ -dimensional vector. So the output of a GCN is simply an embedding matrix X ′ ∈ R | V |×f ′ , generated by information from both the node feature and the node connections.
A GCN consists of multiple graph convolutional layers. Figure  1 visualizes layer (ℓ + 1), where 0 ≤ ℓ ≤ L − 1. The layer has |V | input nodes, each attributed by a length-f (ℓ) vector X (ℓ) u T . Note, for the first layer (i.e., ℓ = 0), X (0) = X , and f (0) = f . For the last layer, X (L) = X ′ and f (L) = f ′ . The layer performs the following operations to obtain the length-f (ℓ+1) output vectors X (ℓ+1) u T : Feature aggregation. As shown in Figure 1 , each node u sends its feature X (ℓ) u T via two types of connection: the self-connection (blue) and the neighbor-connection (green). Neighbor connection is defined by the edges E. The features propagated to the same destination via neighbor-connection are aggregated by vector mean.
Use node 0 as an example. Its input is X 
Weight transformation. Each node independently transforms its two vectors output from the feature aggregation step. In Figure  1 , we denote the transform function as h (·), parameterized by the GCN weights. Transformation function for u and v are identical. Equation 1 defines the forward path of layer ℓ + 1 . D is the (diagonal) degree matrix, where D u,u equals degree of node u, and D u,v equals zero for u v. A is the adjacency matrix of G, where A u,v is 1 if (u, v) ∈ E, and 0 otherwise. Each layer has two weight matrices: the self-weight W (ℓ+1) • and the neighbor-weight W (ℓ+1) ⋆ . The "|" operation concatenates two matrices column-wise. The feature aggregation operation corresponds to D −1 · A · X (ℓ) , and the transformation function h (·) applies weights W
concatenates the intermediate results, and applies ReLU activation.
The output embedding X ′ (or X (L) ) of a GCN is often used in downstream applications such as node classification [5, 6, 14] . To classify nodes, we feed X ′ to a Multi-Layer Perceptron (MLP). A softmax layer [11] converts the MLP outputs to node labels.
Such node classification procedure enables supervised learning of W (ℓ)
⋆ . Suppose each training node v ∈ V is provided with a ground truth label. During training, we calculate cross-entropy loss L to measure difference between the ground truth and the generated labels. We minimize the loss by updating W (ℓ)
using gradient descent. Gradients with respect to layer ℓ parameters can be calculated from gradients of layer ℓ + 1 by chain rule:
:, 0 :
:,
where function mask (·) corresponds to gradient of ReLU (·). Both mask and ReLU are light-weight elementwise functions on matrices. For the node classifier, forward pass of a MLP layer is simply X out MLP = ReLU W MLP · X in MLP , where X in MLP and X out MLP are the input and output features for all V, and W MLP is the layer weight. Backward pass of a MLP layer performs computation similar to Equations 2a and 2c. For the softmax layer, its forward pass involves computation of exponential function, and its backward pass under cross-entropy loss only involves matrix subtraction [1] .
To train GCNs on large graphs, minibatches need to be constructed. Unlike the case of CNN training where images are independent and identically distributed, training samples of GCNs (i.e., graph nodes) depends on each other due to edges. Thus, formulating GCN minibatches is challenging. There have been various techniques to sample minibatches [4-6, 14, 15, 34] . Section 3.1 analyzes the feasibility of hardware implementation for these techniques.
Deep Learning Training Accelerators
Various FPGA-based accelerators [10, 13, 23, 28, 40] have been proposed to train CNNs. The work in [40] proposes a modular design based on the types of layer operations, and improves performance 
One GCN layer based on G To accelerate GCN training, the work in [34] proposes parallelization techniques for multi-core platform. It partitions features to increase cache-hit of each core. Although GCNs are an extension of CNNs to graphs, challenges to accelerate GCNs are significantly different. GCNs require both sparse and dense matrix operations. Apart from intensive computation, GCN accelerators need to address issues such as irregular memory access and load-balance.
Graph Analytics Accelerators
Many accelerator designs [7, 8, 16, 36, 41] have been proposed for traditional graph analytic problems, such as PageRank (PR), single source shortest path (SSSP) and breadth first search (BFS). The works in [7, 41] process input graphs in a two-phased gather-scatter manner, and utilize graph partitioning to improve access locality. The work in [8] extends the above memory optimization to multi-FPGA platforms, and the works in [16, 36] propose optimization specific to HBM and HMC to further boost FPGA performance.
The memory optimizations proposed in the above works do not directly apply to GCN accelerators. First of all, traditional graph analytics often propagate scalars along the graph edges, while GCNs propagate long feature vectors. Thus, although in both cases memory accesses are irregular, the access patterns are very different. Secondly, traditional graph analytics often propagate information within the full graph, while GCNs propagate within minibatches. Thus, techniques such as graph partitioning may not be effective since minibatch size is much smaller than the full graph size.
TRAINING ALGORITHM
We observe from the forward pass (Equation 1) and the backward pass (Equation 2), that GCN training involves three types of matrix product P ·Q: (1) P being dense and Q being dense, (2) P being binary sparse and Q being dense, and (3) P being diagonal and Q being dense. From acceleration perspective, type (1) operation is computationally intensive and thus requires massive parallelization of DSPs (Section 4.3); type (2) operation incurs large number of irregular BRAM accesses and motivates the graph topology based optimizations (Section 3.2); type (3) operation is equivalent to scaling each row of Q by the diagonal elements of P, and thus its contribution to the total training cost is negligible. Other operations (e.g., mask, ReLU, concatenation, sub-matrix extraction) are light-weight and straightforward to implement, and we do not discuss them in detail.
Algorithm Selection
For efficient training on large graphs, the first step is to reduce external memory communication via minibatches. Ideally, a minibatch should be sampled such that all data required for gradient calculation (e.g., X (ℓ) of the minibatch nodes) fits in BRAM. Among the numerous algorithms [5, 6, 14, 15, 34] , some return minibatches not suitable for hardware execution. We categorize these algorithms and analyze the hardware cost on external memory accesses.
Minibatch by sampling GCN layers [5, 6, 14, 15] . Sampling algorithms in this category traverse GCN layers backward from layer-L outputs to layer-1 inputs to select minibatch nodes. Assume b nodes of layer ℓ + 1 are selected. The sampler then take α · b nodes of layer ℓ based on the inter-layer connections and/or the features of the d · b neighbors in layer-ℓ (where d is the average node degree). Specifically, the sampling of [14] does not depend on neighbor features and 10 ≤ α ≤ 50 in general. [6] uses the neighbor features as "historical activation" and α = 2. [15] computes the node probability by the neighbor features and α = 1 on average. [5] does not require neighbor features and α = 1. In summary, suppose we sample b 0 output nodes of layer L. Then the sampler reads the neighbor features of each layer (if required) to eventually return α L · b 0 input nodes in layer 1. During training, we read features of the α L · b 0 input nodes of layer 1, and then perform forward propagation to compute output features of the layer-ℓ sampled nodes (1 ≤ ℓ ≤ L).
Minibatch by sampling training graph [34] . The algorithm samples from G instead of the GCN layers. Given a graph sampling algorithm (e.g., multi-dimensional random walk [26] ), [34] 
The minibatch contains the same set of nodes V s for all the GCN layers. In other words, for each minibatch, [34] constructs a complete L-layer GCN from G s , with the layer nodes defined by V s and the layer connections defined by E s . Note that unlike [6, 15] , the graph sampler of [34] requires no information from node features.
Suppose we were to implement the above minibatch training algorithms on FPGA. Since the full X (ℓ) is too large to fit on-chip, we need to read from external memory the following data: (1) layer-1 input features of the minibatch nodes, and (2) layer-ℓ input features of the minibatch neighbor nodes (if required). For ease of analysis, assume the feature sizes of each layer are the same, and let the cost of transferring one feature vector be 1. Also, let the cost of aggregation and computing h (·) for one node (see Figure 1 ) be 1. Ignore the computation cost of feature aggregation. Table 1 summarizes the ratio between on-chip computation cost and off-chip communication cost, where for the "Value" row, we set L = 2, d = 15 and α as 25, 1, 2, 1 for [14] , [5] , [6] , [15] respectively. Algorithms of low computation-communication ratio (i.e., [6, 14, 15] ) impede the development of efficient hardware accelerators. For the remaining two algorithms, the complexity of the sampler of [34] is much lower than that of [5] , and the training accuracy of [34] is higher. Thus, we select [34] as the target training algorithm for acceleration. 
Value 0.04 2 0.06 0.06 2
Remark on notation. Since we select the algorithm of [34] , in the following sections, we mainly focus on GCN built on a subgraph G s . A symbol with an "s" subscript means it is related with a subgraph. Thus, forward and backward pass of the GCN can be updated by simply replacing A, D, X (·) with A s , D s , X (·) s in Equations 1 and 2.
Redundancy Reduction
Before designing the hardware architecture, we first optimize the redundancy in minibatch training from an algorithm perspective. Take Figure 2a as an example. Suppose the graph sampler returns a subgraph G s (of G in Figure 1 ) as the minibatch. In each GCN layer, nodes 1, 2, 3 and 4 aggregate neighbor features, with node 0 calcu-
(omitting superscript (ℓ) for simplicity). This corresponds to 6 vector additions and 10 vector reads in total. Observe that the vector pair X T 0 , X T 2 appears in the aggregation of both nodes 1 and 3. Similarly, the pair X T 1 , X T 3 is aggregated by both nodes 0 and 2. Thus, we can perform pre-processing to compute the partial sum
The aggregation of the four nodes after preprocessing requires only 2 additions and 4 reads. In general, even considering the pre-processing cost, we can still significantly reduce both computation and communication of feature aggregation. The redundancy reduction helps us achieve perfect load-balance of the complete pipeline for a wide range of FPGAs (Section 5.3).
The key to reduce redundancy is to identify sets of nodes appearing frequently in the neighbor lists of v ∈ V s . To simplify the problem, we aim at finding such sets of size 2 -we identify common pairs of neighbors. We first construct an un-directed aggregation graph G a from G s by Algorithm 1. Each edge (u, v) in G a represents potential vector sum operations between u and v. The attribute of (u, w) consists of a set of nodes {v 1 , . . . , v n }, meaning that the neighbor list of v i (1 ≤ i ≤ n) contains both u and w. The weight of an edge is simply the size of its attribute set (we remove edges of weight 1). In Algorithm 1, W a and D a can be implemented as a hash table. And W a [(u, w)] (or D a [(u, w)]) returns the value corresponding to the key (u, w). Figure 2b shows the aggregation graph G a for our above example, where edges have weight 2.
▷ Key-value pairs mapping edges to edge weights 3: D a ← ∅ ▷ Key-value pairs mapping edges to edge attributes 4: for v ∈ V a do 5:
if Key (u, w) not in W a then 8:
Add key-value pair ((u, w) , 1) to W a 9:
if Key (u, w) not in D a then 12:
Add key-value pair ((u, w) , {v}) to D a 13: 15: Remove key-value pairs of weight-1 edges in W a and D a 16: G a ← Un-directed graph based on W a and D a Intuitively, the next step upon getting G a is to extract the edges with largest weights, so that pre-computation of the corresponding vector sums reduces the most redundancy. However, there is one subtlety to notice. Suppose two edges of G a , u i , u j and u j , u k , have large weights, and {v 1 , . . . , v m }, the intersection of their attributes, is non-empty. Consider the aggregation of nodes
By replacing the pair X T i , X T j with x ′ , the other pair X T j , X T k disappears in aggregation. So x ′′ does not help reduce redundancy.
To avoid such useless pre-computation on x ′′ , a good solution is to find a maximum weight matching of G a , so that the selected pairs imply high redundancy, and share no common nodes. Proof. We first consider the reduction in number of reads. Since edges in a matching are disjoint, for each (u, v) ∈ M * a , accessing
] number of reads. The pre-computation of X T u + X T v consumes 2 reads. In sum, the total reduction is (W a [e] − 2). The proof for reduction in number of additions can be similarly derived. □ Note that Theorem 3.1 considers the total cost including any precomputation cost. Also, although polynomial complexity algorithm [22] exists for computing the maximum weight matching, it is still too expensive in practice. Thus, we propose a greedy approach (Algorithm 2) to quickly compute a good matching M a . 
The next step after obtaining M a is to update the original G s to G ′ s (following Algorithm 3), so that the pre-computed sums can propagate in G ′ s . The idea is to merge each pair of nodes in M a into a new node, whose feature vector is returned by precomputation. Compared with G s , the updated G ′ s has more nodes ( V ′ s = |V s | + |M a |), but less edges. Note that G ′ s is directed, even when G s is un-directed. The example G ′ s is shown in Figure 2c . 
Complexity analysis. Complexity of Algorithms 1, 2 and 3 are low compared with the feature aggregation. Complexity of Algorithm
where d max and d are the max and average degree of G s . Complexity of Algorithm 2 is O (|E a | + N log |E a |), where the first term counts for line 2, and the second term is for the loop from line 4 to 9.
Number of times max-edge is extracted from heap, N , depends on the threshold θ . Typically, N ≪ |E a |, |V s | < 5000, d < 20. Overall complexity of Algorithms 1 and 2 is much less than the complexity of single layer feature aggregation (i.e., O (|E s | · f ), where the feature length f is to the order of 10 2 to 10 3 ). For Algorithm 3, lines 6 and 7 each takes at most d w operations if we simply scan the neighbor list of w. In return, we save one vector sum operation for u and v. Considering d ≪ f , overhead of Algorithm 3 is thus negligible compared with the benefit in redundancy reduction. It is worth noticing that (1) one time transformation from G s to G ′ s benefits 3L number of feature aggregation in a L-layer GCN, and (2) such transformation, which only involves integer operations, reduces floating point arithmetics during feature aggregation. These two observations further justify the cost of Algorithms 1, 2 and 3.
Iterative redundancy reduction. After obtaining G ′ s , redundancy still exists when aggregating features of G ′ s . Viewing G ′ s as the new G s , we can again apply Algorithms 1, 2 and 3 to obtain a G ′′ s . This process can continue until few edges can be reduced (i.e., the matching is small). Note that although the training graph (and thus G s ) is often un-directed, Algorithms 1, 2 and 3 still apply when G s is directed. Therefore, iterative redundancy reduction is feasible.
For sake of notation, define one round of redundancy reduction as one invocation of Algorithms 1, 2 and 3. 
ACCELERATOR DESIGN 4.1 Overview
We first partition the workload between FPGA and CPU. We let FPGA execute the computation intensive operations, and leave the communication intensive parts to CPU. A natural partition is: CPU performs graph sampling, and then converts G s to G # s ; FPGA performs the forward and backward pass defined by Equations 1 and 2, where feature propagation of each layer is based on G # s . Notice that under supervised learning, the last graph convolutional layer is followed by a node classifier (see Section 2.1). The softmax layer at the end of the classifier and the cross-entropy loss require computation of exponential and logarithmic functions. Since softmax and loss contribute to a negligible portion of the total workload and their accurate calculation requires complicated hardware [9, 21] , we assign their computation to CPU. Section 4.4 describes the scheduling.
To improve overall training throughput, we need to (1) reduce the overhead in external memory access, and (2) increase the utilization of the on-chip resources. The first goal can be achieved by setting the minibatch parameters so that the size of G s is appropriate for the BRAM capacity. Ideally, once receiving the initial node features (i.e., X Also note that the optimizer specific auxiliary data depends on the optimizer used. For vanilla gradient descent, no auxiliary data is needed. For gradient descent with momentum [27] , we need to store the gradient with respect to model weights for the previous minibatch. For Adam optimizer [17] , we need to store more gradient related data (e.g., first moment estimate, second moment estimate, etc. ). No matter what optimizer is used, the size of the auxiliary data is comparable to the size of model weights.
Regarding the processing pipeline on-chip, there are two main computation modules to perform feature aggregation and weight transformation. In Figure 3 , the Feature Aggregation module consists of a 1D accumulator array to sum the node vectors. The Weight Transformation module consists of a 2D systolic array to compute the dense matrix product between X and W . The two modules are reused for the computation of the L graph convolutional layers, and the Weight Transformation module is also reused for the MLP layer. The various BRAM buffers stores the data listed above. During forward pass, when computing layer ℓ, the Feature Aggregation and Weight Transformation modules read from the buffer of X 
Feature Aggregation Module
This module performs feature aggregation in three steps. The first pre-computation step calculates the vector sum of node pairs in M a , and stores the results in the buffer for X M . The second step computes A # s · X into a temporary buffer to be consumed by the Weight Transformation module. For the below discussion, we ignore step 3 since its complexity is low (i.e., O (|V s |)) compared with the other steps (i.e., O (|E s |)). Regarding steps 1 and 2, since the feature length is large, we explore parallelism along the feature dimension. In this case, a simple 1D accumulator array of size P agg = max 0≤ℓ ≤L−1 f (ℓ) is sufficient. During pre-computation, pairs of node indices are read sequentially from M a . Vector sum of each pair consumes two cycles. Similarly, during propagation of A # s , indices in the neighbor lists are read sequentially. Note that even though M a contains pairs of multiple rounds (see last paragraph of Section 3.2), as long as the pre-computation on the earlier rounds finishes before that of later rounds, the memory controller for the accumulator array is straightforward. Remark. To further increase parallelism, we can aggregate vectors of multiple neighbors in the same cycle. Then the accumulator array would be replaced by an accumulator tree, and the buffer would be further partitioned to reduce bank conflicts in BRAM. In this case, challenges such as load-balance would emerge. Fortunately, for most target FPGA devices, parallelism of just max f (ℓ) is sufficient. We discuss this in detail in Section 5.3 and 5.4. Also, we evaluate the storage overhead due to X M in Section 6.2.
Weight Transformation Module
This module performs weight transformation of either a GCN layer (i.e., h (·) function in Figure 1 ) or a MLP layer. The main operation is multiplication of dense matrices. We use a 2D systolic array to execute the blocked matrix multiplication algorithm. Let the dimension of the systolic array be P sys (where P sys ≪ f (ℓ) ). So the computation parallelism of this module is P 2 sys , and each cycle 2P sys data are streamed in the array. Next we specify the data access pattern. Suppose we compute X · W , where X ∈ R | V s |×f and W ∈ R f ×f ′ . Then the X buffer is depth-wise partitioned to tiles of layout P sys × f , and the W buffer is width-wise partitioned to tiles of layout f × P sys . There are Controller feature from the buffer filled by the Feature Aggregation module. When operating on the self weight W • , conflicts may occur since both modules read from the X buffer. We add a small tile buffer in the Weight Transformation module to completely avoid read conflicts. The tile buffer of size P sys × f stores a tile of X , and is filled in f cycles. Data in the tile buffer stay for f + P sys − 1 · f ′ P sys cycles to enumerate all tiles of W . Read conflicts can only happen during the filling of the tile buffer. And we simply stall the Feature Aggregation module in this short period of time. Since f ′ ≫ P sys , the pipeline stall has negligible impact on performance. The above analysis is based on the forward pass operation. In the backward pass, we swap the dimensions of |V s | and f , and the same conclusion holds.
Optimizer data

Scheduling
Scheduling between CPU and FPGA. CPU samples the subgraph G s , transforms it to G # s , and calculates softmax, loss and the corresponding gradients. These are shown in light blue blocks of Figure  4 . FPGA handles majority of the computation workload in the forward and backward pass, as shown in light green. Communication between CPU and FPGA, as specified in Section 4.1, is in dark blue. Notice that the subgraphs are independently sampled for each minibatch. Thus, CPU can prepare G s and G # s for the next minibatch, while simultaneously FPGA is training the current one. This explains the overlap between step 7 and 2. Parallelism can be explored by multiple cores of the CPU. The C cores can process subgraphs for the next C minibatches in parallel without any dependency.
Scheduling of FPGA modules. In Figure 4 , we only show the scheduling of the GCN forward pass. Scheduling of MLP and the backward pass can be analogously designed. Computation on neighbor weights depends on the aggregated feature and computation on self weights has no dependency. Thus, we overlap the operations of a and b. The avoidance of read conflicts between a and b is discussed in Section 4.3. During operation c, Feature Aggregation module is idle. We analyze its impact on DSP utilization in Section 5.4.
PERFORMANCE ANALYSIS
To simplify notation, we assume f (ℓ) = f , ∀0 ≤ ℓ ≤ L. So the weight matrices W
Session: Deep Learning II FPGA '20, February 23-25, 2020, Seaside, CA, USA 1 2 factor is due to the concatenation in the forward pass. For the classifier, we assume a single layer MLP, with W MLP ∈ R f ×f . Regarding FPGA resources, we assume accumulators and multipliers are implemented by DSPs and have the same hardware cost. We assume the target FPGA can implement R DSP number of accumulators / multipliers, store R BRAM words, and communicate R BW words with external memory per cycle. Here a word represents an element of the feature vectors or the weight matrices.
Training performance depends on the parameters related to:
• Minibatch and GCN: |V s |, d and f • Redundancy reduction: γ add , γ read and M ∈M a |M | • FPGA architecture: P agg , P sys
We also use the fact that d ≪ f ≪ |V s | to simplify analysis.
Computation
Each graph convolutional layer performs feature aggregation 3 times and product on weights 6 times. Two of the feature aggregation operate on length-f vectors, and the other one on length-1 2 f vectors. Since feature aggregation is parallelized along feature dimension only, it takes exactly γ read · |V s | · d · f /P agg cycles to aggregate length-f features. All six products on weights have the same complexity, and each takes 1 2 · |V s | · f 2 /P 2 sys cycles. By the schedule in Figure 4 , to hide the feature aggregation time, we have:
where factor 1 − 2P sys f is due to the pipeline stall analyzed in Section 4.3. Solving Equations 3a and 3b under the constraint of 3c gives the architectural parameters P * agg , P * sys . Under reasonable values of f , |V s |, d, γ read and R DSP , the solutions P * arr and P * sys always exist (see also Section 5.3). Total FPGA cycles * to complete one minibatch is:
Communication
On-chip storage. All data listed in Section 4.1 have to fit on-chip. Index data A # s and M a and coefficient D are negligible compared with feature data (since d ≪ f ). Size of the buffer for X M is f · M ∈M a |M |. Size of the buffer between Feature Aggregation and Weight Transformation modules is f · |V s |. Size of the tile buffer in Weight Transformation module is P sys · |V s |. Weights for each layer takes f 2 . Under gradient descent with momentum, the optimizer requires additional f 2 storage per layer for the auxiliary data. Thus,
5) * We ignore the number of cycles to apply gradients to weights (elementwise operation), since it is negligible compared to the number of cycles to calculate gradients.
By adjusting θ and number of rounds (see Algorithm 2 and Section 3.2), we control |M | to meet a pre-defined budget (say |V s |). Then, we tune the graph sampler so that |V s | satisfies inequality 5.
Off-chip accesses. Ignoring D s , A # s and M a , CPU-FPGA data transfer include the initial features and the MLP output features and gradients. Thus, β, the ratio between on-chip computation and off-chip communication, is lower bounded † by f , indicating high reuse of on-chip data, and low overhead in external memory access.
Load-Balance
If feature aggregation is parallelized along feature dimension only, the full FPGA pipeline is perfectly load-balanced, regardless of the graph connection. However, if we would have to aggregate features of multiple nodes in parallel, BRAM access conflicts would cause load-imbalance of the module and degrade training performance.
Load-imbalance is more likely to happen on FPGAs with more DSP resources. The thresholdR DSP for a design to become loadimbalanced can be derived by plugging P agg = f into Equations 3a and 3b. After simplifying the expression, we have:
Feature dimension is often set to 256, 512 or 1024 in the literature. Subgraph degree is often less than 20, so we set d = 15. The ratio γ read depends on G s , and we set it to 0.7. With such parameter values,R DSP > 4048, meaning that there have to be at least 4048 multipliers / accumulators on chip for our FPGA pipeline to become load-imbalanced. Even the largest FPGAs in the market (e.g., Xilinx UltraScale+, Intel Stratix-10) does not have such high DSP capacity (considering that data in single precision floating point are necessary to achieve high training accuracy). Note that γ read helps keep the thresholdR DSP high. If γ read = 1,R DSP is only 3038.
DSP Utilization
Since load-balance is achieved and BRAM conflicts are eliminated, we can analytically derive the DSP utilization for any G s . From the timeline of Figure 4 , the systolic array is idle when CPU is communicating with FPGA (ignoring the CPU computation time on softmax and loss). The accumulator array of Feature Aggregation module is idle during the stall and during computation on neighbor weights. Using the fact that computation-communication ratio β > f (see Section 5.2), we derive the overall DSP utilization µ as:
where µ ′ = 1 R DSP · 2 P * 
EXPERIMENTS
We evaluate on a Xilinx Alveo U200 board hosted by a 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded). CPU-FPGA communication is via PCIe 3.0 x16 (peak bandwidth: 15.8GB/s). The UltraScale+ XCU200 FPGA on Alveo has 5867 DSP slices, 1766 36Kb block RAMs and 800 288Kb Ultra RAMs ‡ . For Float32, on-chip RAM can store 8166K words, and DSPs can implement 1173 accumulators or 1955 multipliers. We use Vivado 2018.3 for synthesis. We use three standard, large benchmark graphs [5, 6, 14, 15, 34 ] to evaluate the training speed and accuracy. In Table 2 , "Attribute" specifies the initial feature length (i.e., f (0) = X (0) v ). "Classes" specifies total number of node classes. "Train/Val/Test" splitting follows [34] . For all the datasets, following [5, 6, 14, 15, 34] , the GCN has two graph convolutional layers (L = 2) and one MLP layer in the classifier. The hidden dimension is f (ℓ) = 256, where ℓ = 1, 2.
We compare with the baseline [34] since GraphACT uses the same minibatch algorithm as [34] . For [34] , we run the public implementation § by Python (version 3.6.7) and Tensorflow (version 1.12). We run [34] on both the CPU and GPU platforms. Table 4 shows the hardware specification of the hardware. Parallelization of [34] on CPU is via the Intel MKL library. Thus, we add the --mkl flag to run [34] on CPU. Parallelization of [34] on GPU is directly via Tensorflow. The implementation of GraphACT is open-sourced .
Training Accuracy
Under the same hyper-parameters, the accuracy and convergence of GraphACT are identical [34] . However, since we set the subgraph size under the BRAM constraint (Equation 5), it is necessary to evaluate accuracy under our settings. Table 3 summarizes the comparison, where except the subgraph size, all hyper-parameters of Para-GCN and GraphACT are the same. The popular GraphSAGE does not sample subgraphs, and we use its default minibatch size of 512. Comparing with [14] , we achieve much higher accuracy on PPI, and comparable accuracy on Reddit and Yelp after similar number of epochs. Comparing with [34] , reducing the subgraph size |V s | has negligible impact on the overall accuracy and convergence rate.
Redundancy Reduction
After transforming G s to G # s , redundancy in computation and communication are reduced at the cost of extra on-chip buffer to store ‡ In this paper, we refer to block RAM and Ultra RAM by a general term "BRAM". § Baseline code: https://github.com/ZimpleX/gcn-ipdps19, commit a6f531.
GraphACT code: https://github.com/GraphSAINT/GraphACT the vector sums of M a . Figure 5 shows such tradeoff, as well as the effectiveness of redundancy reduction. Each "×" marker corresponds to one round of redundancy reduction, and we run for 5 rounds (θ = 2, Algorithm 2). We observe (1) number of reads and additions can be reduced to as low as 60%, at the cost of a buffer less than 2 |V s |, (2) the γ read value used in Section 5.3 and 5.4 (to analyze load-balance and DSP utilization) is realistic, and (3) amount of redundancy depends on the graph topology -while our algorithm is effective on many graphs (e.g., PPI, Yelp), exceptions exist (e.g., Reddit). By Figure 5 , in our implementation, we set a budget B such that M ∈M a |M | < B · |V s |. For PPI, Reddit, Yelp, B = 2, 0.5, 0.5. Table 4 shows the training speed comparison with the state-of-theart [34] . All implementations use single precision floating point for weights and features. In our design, the DSP and BRAM resources are heavily utilized. We set P sys = 24 and P arr = 128 for all datasets. For the multi-core implementation, "L3" shows the aggregated L3cache size and "Off-chip bandwidth" shows the peak CPU-main memory data transfer speed. For the GPU implementation, "Offchip bandwidth" is the same as the proposed implementation since the GPU is connected via PCIe 3.0 x16. "Total convergence time" includes the time for graph sampling, pre-processing (for redundancy reduction), GCN forward/backward propagation and data transfer from/to main memory. It excludes the time for initial data loading from the disk and Tensorflow initialization (for allocating device resources and building computation graphs).
Comparing with the CPU baseline, we achieve 12× to 15× speedup. Apart from the overhead of the Python language, inefficiency of the CPU baseline is mainly due subgraph feature aggregation operation. We observe that although feature aggregation requires less than 10% of the multiplication/addition of weight transformation, the CPU spends about equal amount of time on the two operations (see also Figure 3 .D of [34] ). Comparing with the GPU baseline, our design convergences slightly faster (1.1× to 1.5×). The theoretical peak performance of the GPU (9.3 TFLOPS for Float32 [2]) is much higher than that of the FPGA (1.8 TFLOPS for Float32 [3] ∥ ). Inefficiency of the GPU baseline is mainly due to the sub-optimal Tensorflow implementation of sparse matrix multiplication. Note that the CPU in our proposed design only executes light weight operations (Section 3.2 and 4.4). Thus, the training time improvement mainly comes from the highly optimized FPGA architecture. ∥ Each Float32 multiplier consumes 3 DSPs. The max DSP frequency is 775MHz. Although the baseline training time may be further improved by re-implementation using C++/CUDA, inefficiency in CPU and GPU due to sparse feature aggregation may not be easily eliminated.
DISCUSSION
This work proposed several hardware-software optimizations for GCN training on CPU-FPGA heterogeneous platforms. We discuss these optimizations and their applicability to various platforms.
Design challenges. The key to accelerating GCN training is to address the challenges of memory access and load-balance. The solutions for these differ on various platforms. Regarding memory access, the solution on FPGA must optimize both on-chip and offchip accesses. For data in BRAMs, we need to increase their reuse so as to reduce off-chip communication. We also need to reduce bank access conflicts to reduce pipeline stalls. In GraphACT, we reduce off-chip communication by setting the subgraph size based on the BRAM capacity (Section 5.2). We eliminate on-chip access conflicts by appropriately parallelizing the feature aggregation operation (Section 4.2) and buffering data tiles between computation modules (Section 4.3). On the other hand, for CPU and GPU, the critical issue is to enhance data access locality since the memory hierarchy is not explicitly exposed to the programmer. Considering that the full graph fits in the CPU main memory or GPU global memory (sizes of which range from 10 1 to 10 3 GB), the memory access challenge on CPU and GPU may be addressed by software nodereordering or by hyper-threading. Regarding load-balance, CPU and GPU can decouple the operations of feature propagation and weight transformation, and then adopt separate strategies to balance them. However, on FPGA, since the two operations are processed by a single pipeline, an integrated strategy needs to simultaneously balance the very different operations (one on dense tensors and the other on sparse tensors). While memory access and load-balance are challenging to optimize, a carefully designed FPGA pipeline can achieve much higher efficiency than the CPU or GPU solutions. This is due to the ability of FPGA to customize computation modules and to control memory accesses in an explicit and fine-grained manner.
Remark on redundancy reduction. Although redundancy reduction is an algorithm-level optimization independent of the platform, it may not be directly applicable to CPU to improve performance. Recall that redundancy reduction (Section 3.2) constructs a subgraph G # s with less edges yet more nodes than G s . Thus, feature aggregation on G # s may have worse data locality than G s . On a CPU, even though the L3-cache can hold the node features of G # s , the overall feature aggregation performance may not benefit from the reduced computation load due to the potential increase in L1-or L2cache misses. On the other hand, data access locality of GraphACT does not affect FPGA performance, as long as data of one minibatch completely fits in BRAM. Under the proposed architecture and data layout (Section 4), BRAM will always provide the requested node feature of G # s to the Feature Aggregation module within one cycle.
