67,904 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
- …