13 research outputs found
Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems
Computing SpMV on FPGAs
There are hundreds of papers on accelerating sparse matrix vector multiplication (SpMV), however, only a handful target FPGAs. Some claim that FPGAs inherently perform inferiorly to CPUs and GPUs. FPGAs do perform inferiorly for some applications like matrix-matrix multiplication and matrix-vector multiplication. CPUs and GPUs have too much memory bandwidth and too much floating point computation power for FPGAs to compete. However, the low computations to memory operations ratio and irregular memory access of SpMV trips up both CPUs and GPUs. We see this as a leveling of the playing field for FPGAs.
Our implementation focuses on three pillars: matrix traversal, multiply-accumulator design, and matrix compression. First, most SpMV implementations traverse the matrix in row-major order, but we mix column and row traversal. Second, To accommodate the new traversal the multiply accumulator stores many intermediate y values. Third, we compress the matrix to increase the transfer rate of the matrix from RAM to the FPGA. Together these pillars enable our SpMV implementation to perform competitively with CPUs and GPUs
EIE: Efficient Inference Engine on Compressed Deep Neural Network
State-of-the-art deep neural networks (DNNs) have hundreds of millions of
connections and are both computationally and memory intensive, making them
difficult to deploy on embedded systems with limited hardware resources and
power budgets. While custom hardware helps the computation, fetching weights
from DRAM is two orders of magnitude more expensive than ALU operations, and
dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs
(AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by
pruning the redundant connections and having multiple connections share the
same weight. We propose an energy efficient inference engine (EIE) that
performs inference on this compressed network model and accelerates the
resulting sparse matrix-vector multiplication with weight sharing. Going from
DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x;
Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x.
Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to
CPU and GPU implementations of the same DNN without compression. EIE has a
processing power of 102GOPS/s working directly on a compressed network,
corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of
AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is
24,000x and 3,400x more energy efficient than a CPU and GPU respectively.
Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy
efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly:
https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision:
http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at
Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University:
https://goo.gl/6lwuer. Published as a conference paper in ISCA 201
Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA
Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex nonlinear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, introducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communication of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplified model (software), or inevitable accuracy degradation and limited supportable algorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologies for DNNs is needed.
The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. Secondly, we propose a data locality-aware sparse matrix and vector multiplication (SpMV) kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopting hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these
two domains
Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture
We introduce Stardust, a compiler that compiles sparse tensor algebra to
reconfigurable dataflow architectures (RDAs). Stardust introduces new
user-provided data representation and scheduling language constructs for
mapping to resource-constrained accelerated architectures. Stardust uses the
information provided by these constructs to determine on-chip memory placement
and to lower to the Capstan RDA through a parallel-patterns rewrite system that
targets the Spatial programming model. The Stardust compiler is implemented as
a new compilation path inside the TACO open-source system. Using cycle-accurate
simulation, we demonstrate that Stardust can generate more Capstan tensor
operations than its authors had implemented and that it results in 138
better performance than generated CPU kernels and 41 better performance
than generated GPU kernels.Comment: 15 pages, 13 figures, 6 tables