3,809 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
ReBNet: Residual Binarized Neural Network
This paper proposes ReBNet, an end-to-end framework for training
reconfigurable binary neural networks on software and developing efficient
accelerators for execution on FPGA. Binary neural networks offer an intriguing
opportunity for deploying large-scale deep learning models on
resource-constrained devices. Binarization reduces the memory footprint and
replaces the power-hungry matrix-multiplication with light-weight XnorPopcount
operations. However, binary networks suffer from a degraded accuracy compared
to their fixed-point counterparts. We show that the state-of-the-art methods
for optimizing binary networks accuracy, significantly increase the
implementation cost and complexity. To compensate for the degraded accuracy
while adhering to the simplicity of binary networks, we devise the first
reconfigurable scheme that can adjust the classification accuracy based on the
application. Our proposition improves the classification accuracy by
representing features with multiple levels of residual binarization. Unlike
previous methods, our approach does not exacerbate the area cost of the
hardware accelerator. Instead, it provides a tradeoff between throughput and
accuracy while the area overhead of multi-level binarization is negligible.Comment: To Appear In The 26th IEEE International Symposium on
Field-Programmable Custom Computing Machine
FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software
A deep-learning inference accelerator is synthesized from a C-language
software program parallelized with Pthreads. The software implementation uses
the well-known producer/consumer model with parallel threads interconnected by
FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into
parallel FPGA hardware, translating software parallelism into spatial
parallelism. A complete system is generated where convolution, pooling and
padding are realized in the synthesized accelerator, with remaining tasks
executing on an embedded ARM processor. The accelerator incorporates reduced
precision, and a novel approach for zero-weight-skipping in convolution. On a
mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective
GOPS
XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to
conventional deep neural networks at a fraction of the cost in terms of memory
and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully
digital configurable hardware accelerator IP for BNNs, integrated within a
microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid
SRAM / standard cell memory. The XNE is able to fully compute convolutional and
dense layers in autonomy or in cooperation with the core in the MCU to realize
more complex behaviors. We show post-synthesis results in 65nm and 22nm
technology for the XNE IP and post-layout results in 22nm for the full MCU
indicating that this system can drop the energy cost per binary operation to
21.6fJ per operation at 0.4V, and at the same time is flexible and performant
enough to execute state-of-the-art BNN topologies such as ResNet-34 in less
than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation
at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design
of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
- …