158 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Design and Optimization of Residual Neural Network Accelerators for Low-Power FPGAs Using High-Level Synthesis
Residual neural networks are widely used in computer vision tasks. They
enable the construction of deeper and more accurate models by mitigating the
vanishing gradient problem. Their main innovation is the residual block which
allows the output of one layer to bypass one or more intermediate layers and be
added to the output of a later layer. Their complex structure and the buffering
required by the residual block make them difficult to implement on
resource-constrained platforms. We present a novel design flow for implementing
deep learning models for field programmable gate arrays optimized for ResNets,
using a strategy to reduce their buffering overhead to obtain a
resource-efficient implementation of the residual layer. Our high-level
synthesis (HLS)-based flow encompasses a thorough set of design principles and
optimization strategies, exploiting in novel ways standard techniques such as
temporal reuse and loop merging to efficiently map ResNet models, and
potentially other skip connection-based NN architectures, into FPGA. The models
are quantized to 8-bit integers for both weights and activations, 16-bit for
biases, and 32-bit for accumulations. The experimental results are obtained on
the CIFAR-10 dataset using ResNet8 and ResNet20 implemented with Xilinx FPGAs
using HLS on the Ultra96-V2 and Kria KV260 boards. Compared to the
state-of-the-art on the Kria KV260 board, our ResNet20 implementation achieves
2.88X speedup with 0.5% higher accuracy of 91.3%, while ResNet8 accuracy
improves by 2.8% to 88.7%. The throughputs of ResNet8 and ResNet20 are 12971
FPS and 3254 FPS on the Ultra96 board, and 30153 FPS and 7601 FPS on the Kria
KV26, respectively. They Pareto-dominate state-of-the-art solutions concerning
accuracy, throughput, and energy
- …