49 research outputs found
GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep
learning model for representation learning on graphs. It is challenging to
accelerate training of GCNs, due to (1) substantial and irregular data
communication to propagate information within the graph, and (2) intensive
computation to propagate information along the neural network layers. To
address these challenges, we design a novel accelerator for training GCNs on
CPU-FPGA heterogeneous systems, by incorporating multiple
algorithm-architecture co-optimizations. We first analyze the computation and
communication characteristics of various GCN training algorithms, and select a
subgraph-based algorithm that is well suited for hardware execution. To
optimize the feature propagation within subgraphs, we propose a lightweight
pre-processing step based on a graph theoretic approach. Such pre-processing
performed on the CPU significantly reduces the memory access requirements and
the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient
parallelization. We integrate the above optimizations into a complete hardware
pipeline, and analyze its load-balance and resource utilization by accurate
performance modeling. We evaluate our design on a Xilinx Alveo U200 board
hosted by a 40-core Xeon server. On three large graphs, we achieve an order of
magnitude training speedup with negligible accuracy loss, compared with
state-of-the-art implementation on a multi-core platform.Comment: Published in ACM/SIGDA FPGA '2
Domain-specific Architectures for Data-intensive Applications
Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage.
In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications.
Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd
EFFICIENTLY ACCELERATING SPARSE PROBLEMS BY ENABLING STREAM ACCESSES TO MEMORY USING HARDWARE/SOFTWARE TECHNIQUES
The objective of this research is to improve the performance of sparse problems that have a wide range of applications but still, suffer from serious challenges when running on modern computers. In summary, the challenges include the underutilization of available memory bandwidth because of lack of spatial locality, dependencies in computation, or slow mechanisms for decompressing the sparse data, and the underutilization of concurrent compute engines because of the distribution of non-zero values in sparse data. Our key insight to address the aforementioned challenges is that based on the type of the problem, we either use an intelligent reduction tree near memory to process data while gathering them from random locations of memory, transform the computations mathematically to extract more parallelism, modify the distribution of non-zero elements, or change the representation of sparse data. By applying such techniques, the execution adapts more effectively to given hardware resources. To this end, this research introduces hardware/software techniques to enable stream accesses to memory for accelerating four main categories of sparse problems including the inference of recommendation systems, iterative solvers of partial differential equations (PDEs), deep neural networks (DNNs), and graph algorithms.Ph.D