407 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps
Convolutional neural networks (CNNs) have become the dominant neural network
architecture for solving many state-of-the-art (SOA) visual processing tasks.
Even though Graphical Processing Units (GPUs) are most often used in training
and deploying CNNs, their power efficiency is less than 10 GOp/s/W for
single-frame runtime inference. We propose a flexible and efficient CNN
accelerator architecture called NullHop that implements SOA CNNs useful for
low-power and low-latency application scenarios. NullHop exploits the sparsity
of neuron activations in CNNs to accelerate the computation and reduce memory
requirements. The flexible architecture allows high utilization of available
computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can
process up to 128 input and 128 output feature maps per layer in a single pass.
We implemented the proposed architecture on a Xilinx Zynq FPGA platform and
present results showing how our implementation reduces external memory
transfers and compute time in five different CNNs ranging from small ones up to
the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using
Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that
the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop
achieves an efficiency of 368%, maintains over 98% utilization of the MAC
units, and achieves a power efficiency of over 3TOp/s/W in a core area of
6.3mm. As further proof of NullHop's usability, we interfaced its FPGA
implementation with a neuromorphic event camera for real time interactive
demonstrations
SPEC2: SPECtral SParsE CNN Accelerator on FPGAs
To accelerate inference of Convolutional Neural Networks (CNNs), various
techniques have been proposed to reduce computation redundancy. Converting
convolutional layers into frequency domain significantly reduces the
computation complexity of the sliding window operations in space domain. On the
other hand, weight pruning techniques address the redundancy in model
parameters by converting dense convolutional kernels into sparse ones. To
obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work
to prune and accelerate spectral CNNs. First, we propose a systematic pruning
algorithm based on Alternative Direction Method of Multipliers (ADMM). The
offline pruning iteratively sets the majority of spectral weights to zero,
without using any handcrafted heuristics. Then, we design an optimized pipeline
architecture on FPGA that has efficient random access into the sparse kernels
and exploits various dimensions of parallelism in convolutional layers.
Overall, SPEC2 achieves high inference throughput with extremely low
computation complexity and negligible accuracy degradation. We demonstrate
SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex
platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy
loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators
achieve up to 24x higher throughput, compared with the state-of-the-art FPGA
implementations for VGG16.Comment: This is a 10-page conference paper in 26TH IEEE International
Conference On High Performance Computing, Data, and Analytics (HiPC
- …