1,444 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Efficient FPGA Acceleration of Convolutional Deep Neural Networks
Department of Computer EngineeringDeep Convolutional Neural Networks (CNNs) are a powerful model for visual recognition tasks, but due to their very high computational requirement, acceleration is highly desired. FPGA accelerators for CNNs are typically built around one large MAC (multiply-accumulate) array, which is repeatedly used to perform the computation of all convolution layers, which can be quite diverse and complex. Thus a key challenge is how to design a common architecture that can perform well for all convolutional layers. In this paper we present a highly optimized and cost-effective 3D neuron array architecture that is a natural FFt for convolutional layers, along with a parameter selection framework to optimize its parameters for any given CNN model. We show through theoretical as well as empirical analyses that structuring compute elements in a 3D rather than a 2D topology can lead to higher performance through an improved utilization of key FPGA resources. Our experimental results targeting a Virtex-7 FPGA demonstrate that our proposed technique can generate CNN accelerators that can outperform the state-of-the-art solution, by 1.80x to maximum 4.05x for 32-bit ??floating-point, and 16-bit fixed-point MAC implementation respectively for different CNN models. Additionally, our proposed technique can generate designs that are far more scalable in terms of compute resources. We also report on the energy consumption of our accelerator in comparison with a GPGPU implementation.ope
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Using FPGAs to accelerate ConvNets has attracted significant attention in
recent years. However, FPGA accelerator design has not leveraged the latest
progress of ConvNets. As a result, the key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting GOPs, and
results on accuracy, which is critical to application success, are often not
even reported. In this work, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel ConvNet model
called DiracDeltaNet. Both the accelerator and ConvNet are tailored
to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with
only convolutions while spatial convolutions are replaced by more
efficient shift operations. DiracDeltaNet achieves competitive accuracy on
ImageNet (88.7\% top-5), but with 42 fewer parameters and 48
fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and
activations to 4-bits, with less than 1\% accuracy loss. These quantizations
exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model
size, low computational OP count, low precision and simplified operators allow
us to co-design a highly customized computing unit for an FPGA. We implement
the computing units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on
ImageNet, is higher than all the previously reported embedded FPGA
accelerators. In addition, the accelerator reaches an inference speed of 66.3
FPS on the ImageNet classification task, surpassing prior works with similar
accuracy by at least 11.6.Comment: Update to the latest result
Multi-LSTM Acceleration and CNN Fault Tolerance
This thesis addresses the following two problems related to the field of Machine Learning: the acceleration of multiple Long Short Term Memory (LSTM) models on FPGAs and the fault tolerance of compressed Convolutional Neural Networks (CNN). LSTMs represent an effective solution to capture long-term dependencies in sequential data, like sentences in Natural Language Processing applications, video frames in Scene Labeling tasks or temporal series in Time Series Forecasting. In order to further boost their efficacy, especially in presence of long sequences, multiple LSTM models are utilized in a Hierarchical and Stacked fashion. However, because of their memory-bounded nature, efficient mapping of multiple LSTMs on a computing device becomes even more challenging. The first part of this thesis addresses the problem of mapping multiple LSTM models to a FPGA device by introducing a framework that modifies their memory requirements according to the target architecture. For the similar accuracy loss, the proposed framework maps multiple LSTMs with a performance improvement of 3x to 5x over state-of-the-art approaches. In the second part of this thesis, we investigate the fault tolerance of CNNs, another effective deep learning architecture. CNNs represent a dominating solution in image classification tasks, but suffer from a high performance cost, due to their computational structure. In fact, due to their large parameter space, fetching their data from main memory typically becomes a performance bottleneck. In order to tackle the problem, various techniques for their parameters compression have been developed, such as weight pruning, weight clustering and weight quantization. However, reducing the memory footprint of an application can lead to its data becoming more sensitive to faults. For this thesis work, we have conducted an analysis to verify the conditions for applying OddECC, a mechanism that supports variable strength and size ECCs for different memory regions. Our experiments reveal that compressed CNNs, which have their memory footprint reduced up to 86.3x by utilizing the aforementioned compression schemes, exhibit accuracy drops up to 13.56% in presence of random single bit faults
Cooperative high-performance computing with FPGAs - matrix multiply case-study
In high-performance computing, there is great opportunity for systems
that use FPGAs to handle communication while also performing
computation on data in transit in an ``altruistic'' manner--that is,
using resources for computation that might otherwise be used for
communication, and in a way that improves overall system performance
and efficiency. We provide a specific definition of \textbf{Computing
in the Network} that captures this opportunity. We then outline some
overall requirements and guidelines for cooperative computing that
include this ability, and make suggestions for specific computing
capabilities to be added to the networking hardware in a system. We
then explore some algorithms running on a network so equipped
for a few specific computing tasks: dense matrix multiplication,
sparse matrix transposition and sparse matrix multiplication. In the
first instance we give limits of problem size and estimates of
performance that should be attainable with present-day FPGA hardware
Assessing the Performance of OpenTitan as Cryptographic Accelerator in Secure Open-Hardware System-on-Chips
RISC-V open-source systems are emerging in deployment scenarios where safety
and security are critical. OpenTitan is an open-source silicon root-of-trust
designed to be deployed in a wide range of systems, from high-end to deeply
embedded secure environments. Despite the availability of various cryptographic
hardware accelerators that make OpenTitan suitable for offloading cryptographic
workloads from the main processor, there has been no accurate and quantitative
establishment of the benefits derived from using OpenTitan as a secure
accelerator. This paper addresses this gap by thoroughly analysing strengths
and inefficiencies when offloading cryptographic workloads to OpenTitan. The
focus is on three key IPs - HMAC, AES, and OpenTitan Big Number accelerator
(OTBN) - which can accelerate four security workloads: Secure Hash Functions,
Message Authentication Codes, Symmetric cryptography, and Asymmetric
cryptography. For every workload, we develop a bare-metal driver for the
OpenTitan accelerator and analyze its efficiency when computation is offloaded
from a RISC-V application core within a System-on-Chip designed for secure
Cyber-Physical Systems applications. Finally, we assess it against a software
implementation on the application core. The characterization was conducted on a
cycle-accurate RTL simulator of the System-on-Chip (SoC). Our study
demonstrates that OpenTitan significantly outperforms software implementations,
with speedups ranging from 4.3x to 12.5x. However, there is potential for even
greater gains as the current OpenTitan utilizes a fraction of the accelerator
bandwidths, which ranges from 16% to 61%, depending on the memory being
accessed and the accelerator used. Our results open the way to the optimization
of OpenTitan-based secure platforms, providing design guidelines to unlock the
full potential of its accelerators in secure applications.Comment: 8 pages, 2 figures, accepted at CF'24 conference, pre camera-ready
versio
- …