94 research outputs found
A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Graph-related applications have experienced significant growth in academia
and industry, driven by the powerful representation capabilities of graph.
However, efficiently executing these applications faces various challenges,
such as load imbalance, random memory access, etc. To address these challenges,
researchers have proposed various acceleration systems, including software
frameworks and hardware accelerators, all of which incorporate graph
pre-processing (GPP). GPP serves as a preparatory step before the formal
execution of applications, involving techniques such as sampling, reorder, etc.
However, GPP execution often remains overlooked, as the primary focus is
directed towards enhancing graph applications themselves. This oversight is
concerning, especially considering the explosive growth of real-world graph
data, where GPP becomes essential and even dominates system running overhead.
Furthermore, GPP methods exhibit significant variations across devices and
applications due to high customization. Unfortunately, no comprehensive work
systematically summarizes GPP. To address this gap and foster a better
understanding of GPP, we present a comprehensive survey dedicated to this area.
We propose a double-level taxonomy of GPP, considering both algorithmic and
hardware perspectives. Through listing relavent works, we illustrate our
taxonomy and conduct a thorough analysis and summary of diverse GPP techniques.
Lastly, we discuss challenges in GPP and potential future directions
GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep
learning model for representation learning on graphs. It is challenging to
accelerate training of GCNs, due to (1) substantial and irregular data
communication to propagate information within the graph, and (2) intensive
computation to propagate information along the neural network layers. To
address these challenges, we design a novel accelerator for training GCNs on
CPU-FPGA heterogeneous systems, by incorporating multiple
algorithm-architecture co-optimizations. We first analyze the computation and
communication characteristics of various GCN training algorithms, and select a
subgraph-based algorithm that is well suited for hardware execution. To
optimize the feature propagation within subgraphs, we propose a lightweight
pre-processing step based on a graph theoretic approach. Such pre-processing
performed on the CPU significantly reduces the memory access requirements and
the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient
parallelization. We integrate the above optimizations into a complete hardware
pipeline, and analyze its load-balance and resource utilization by accurate
performance modeling. We evaluate our design on a Xilinx Alveo U200 board
hosted by a 40-core Xeon server. On three large graphs, we achieve an order of
magnitude training speedup with negligible accuracy loss, compared with
state-of-the-art implementation on a multi-core platform.Comment: Published in ACM/SIGDA FPGA '2
GNNHLS: Evaluating Graph Neural Network Inference via High-Level Synthesis
With the ever-growing popularity of Graph Neural Networks (GNNs), efficient
GNN inference is gaining tremendous attention. Field-Programming Gate Arrays
(FPGAs) are a promising execution platform due to their fine-grained
parallelism, low-power consumption, reconfigurability, and concurrent
execution. Even better, High-Level Synthesis (HLS) tools bridge the gap between
the non-trivial FPGA development efforts and rapid emergence of new GNN models.
In this paper, we propose GNNHLS, an open-source framework to comprehensively
evaluate GNN inference acceleration on FPGAs via HLS, containing a software
stack for data generation and baseline deployment, and FPGA implementations of
6 well-tuned GNN HLS kernels. We evaluate GNNHLS on 4 graph datasets with
distinct topologies and scales. The results show that GNNHLS achieves up to
50.8x speedup and 423x energy reduction relative to the CPU baselines. Compared
with the GPU baselines, GNNHLS achieves up to 5.16x speedup and 74.5x energy
reduction
Computing graph neural networks: A survey from algorithms to accelerators
Graph Neural Networks (GNNs) have exploded onto the machine learning scene in recent years owing to their capability to model and learn from graph-structured data. Such an ability has strong implications in a wide variety of fields whose data are inherently relational, for which conventional neural networks do not perform well. Indeed, as recent reviews can attest, research in the area of GNNs has grown rapidly and has lead to the development of a variety of GNN algorithm variants as well as to the exploration of ground-breaking applications in chemistry, neurology, electronics, or communication networks, among others. At the current stage research, however, the efficient processing of GNNs is still an open challenge for several reasons. Besides of their novelty, GNNs are hard to compute due to their dependence on the input graph, their combination of dense and very sparse operations, or the need to scale to huge graphs in some applications. In this context, this article aims to make two main contributions. On the one hand, a review of the field of GNNs is presented from the perspective of computing. This includes a brief tutorial on the GNN fundamentals, an overview of the evolution of the field in the last decade, and a summary of operations carried out in the multiple phases of different GNN algorithm variants. On the other hand, an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.This work is possible thanks to funding from the European Union’s Horizon 2020 research and innovation programme under Grant No. 863337 (WiPLASH project) and the Spanish Ministry of Economy and Competitiveness under contract TEC2017-90034-C2-1-R (ALLIANCE project) that receives funding from FEDER.Peer ReviewedPostprint (published version
GNNBuilder: An Automated Framework for Generic Graph Neural Network Accelerator Generation, Simulation, and Optimization
There are plenty of graph neural network (GNN) accelerators being proposed.
However, they highly rely on users' hardware expertise and are usually
optimized for one specific GNN model, making them challenging for practical use
. Therefore, in this work, we propose GNNBuilder, the first automated, generic,
end-to-end GNN accelerator generation framework. It features four advantages:
(1) GNNBuilder can automatically generate GNN accelerators for a wide range of
GNN models arbitrarily defined by users; (2) GNNBuilder takes standard PyTorch
programming interface, introducing zero overhead for algorithm developers; (3)
GNNBuilder supports end-to-end code generation, simulation, accelerator
optimization, and hardware deployment, realizing a push-button fashion for GNN
accelerator design; (4) GNNBuilder is equipped with accurate performance models
of its generated accelerator, enabling fast and flexible design space
exploration (DSE). In the experiments, first, we show that our accelerator
performance model has errors within for latency prediction and
for BRAM count prediction. Second, we show that our generated accelerators can
outperform CPU by and GPU by . This framework is
open-source, and the code is available at
https://anonymous.4open.science/r/gnn-builder-83B4/.Comment: 10 pages, 7 figures, 4 tables, 3 listing
NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference
The inherent diversity of computation types within individual deep neural
network (DNN) models necessitates a corresponding variety of computation units
within hardware processors, leading to a significant constraint on computation
efficiency during neural network execution. In this study, we introduce
NeuralMatrix, a framework that transforms the computation of entire DNNs into
linear matrix operations, effectively enabling their execution with one
general-purpose matrix multiplication (GEMM) accelerator. By surmounting the
constraints posed by the diverse computation types required by individual
network models, this approach provides both generality, allowing a wide range
of DNN models to be executed using a single GEMM accelerator and
application-specific acceleration levels without extra special function units,
which are validated through main stream DNNs and their variant models.Comment: 12 pages, 4figures, Submitted to 11th International Conference on
Learning Representation
HiHGNN: Accelerating HGNNs through Parallelism and Data Reusability Exploitation
Heterogeneous graph neural networks (HGNNs) have emerged as powerful
algorithms for processing heterogeneous graphs (HetGs), widely used in many
critical fields. To capture both structural and semantic information in HetGs,
HGNNs first aggregate the neighboring feature vectors for each vertex in each
semantic graph and then fuse the aggregated results across all semantic graphs
for each vertex. Unfortunately, existing graph neural network accelerators are
ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle
the specific execution patterns and exploit the high-degree parallelism as well
as data reusability inside and across the processing of semantic graphs in
HGNNs.
In this work, we first quantitatively characterize a set of representative
HGNN models on GPU to disclose the execution bound of each stage,
inter-semantic-graph parallelism, and inter-semantic-graph data reusability in
HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator,
HiHGNN, to alleviate the execution bound and exploit the newfound parallelism
and data reusability in HGNNs. Specifically, we first propose a bound-aware
stage-fusion methodology that tailors to HGNN acceleration, to fuse and
pipeline the execution stages being aware of their execution bounds. Second, we
design an independency-aware parallel execution design to exploit the
inter-semantic-graph parallelism. Finally, we present a similarity-aware
execution scheduling to exploit the inter-semantic-graph data reusability.
Compared to the state-of-the-art software framework running on NVIDIA GPU T4
and GPU A100, HiHGNN respectively achieves an average 41.5 and
8.6 speedup as well as 106 and 73 energy efficiency
with quarter the memory bandwidth of GPU A100
- …