Search CORE

1,444 research outputs found

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Author: Bouganis Christos-Savvas
Kouris Alexandros
Venieris Stylianos I.
Publication venue
Publication date: 19/02/2018
Field of study

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Efficient FPGA Acceleration of Convolutional Deep Neural Networks

Author: Rahman Atul
Publication venue: Graduate School of UNIST
Publication date: 01/08/2016
Field of study

Department of Computer EngineeringDeep Convolutional Neural Networks (CNNs) are a powerful model for visual recognition tasks, but due to their very high computational requirement, acceleration is highly desired. FPGA accelerators for CNNs are typically built around one large MAC (multiply-accumulate) array, which is repeatedly used to perform the computation of all convolution layers, which can be quite diverse and complex. Thus a key challenge is how to design a common architecture that can perform well for all convolutional layers. In this paper we present a highly optimized and cost-effective 3D neuron array architecture that is a natural FFt for convolutional layers, along with a parameter selection framework to optimize its parameters for any given CNN model. We show through theoretical as well as empirical analyses that structuring compute elements in a 3D rather than a 2D topology can lead to higher performance through an improved utilization of key FPGA resources. Our experimental results targeting a Virtex-7 FPGA demonstrate that our proposed technique can generate CNN accelerators that can outperform the state-of-the-art solution, by 1.80x to maximum 4.05x for 32-bit ??floating-point, and 16-bit fixed-point MAC implementation respectively for different CNN models. Additionally, our proposed technique can generate designs that are far more scalable in terms of compute resources. We also report on the energy consumption of our accelerator in comparison with a GPGPU implementation.ope

ScholarWorks@UNIST

Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

Author: Blott Michaela
Gambardella Giulio
Huang Qijing
Keutzer Kurt
Lavagno Luciano
Ma Liang
Vissers Kees
Wawrzynek John
Wu Bichen
Yang Yifan
Zhang Tianjun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/05/2020
Field of study

Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet

^{\dagger}

. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only

1\times 1

convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7\% top-5), but with 42

\times

fewer parameters and 48

\times

fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1\% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 66.3 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 11.6

\times

.Comment: Update to the latest result

arXiv.org e-Print Archive

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Multi-LSTM Acceleration and CNN Fault Tolerance

Author: Ribes Stefano
Publication venue
Publication date: 01/01/2021
Field of study

This thesis addresses the following two problems related to the field of Machine Learning: the acceleration of multiple Long Short Term Memory (LSTM) models on FPGAs and the fault tolerance of compressed Convolutional Neural Networks (CNN). LSTMs represent an effective solution to capture long-term dependencies in sequential data, like sentences in Natural Language Processing applications, video frames in Scene Labeling tasks or temporal series in Time Series Forecasting. In order to further boost their efficacy, especially in presence of long sequences, multiple LSTM models are utilized in a Hierarchical and Stacked fashion. However, because of their memory-bounded nature, efficient mapping of multiple LSTMs on a computing device becomes even more challenging. The first part of this thesis addresses the problem of mapping multiple LSTM models to a FPGA device by introducing a framework that modifies their memory requirements according to the target architecture. For the similar accuracy loss, the proposed framework maps multiple LSTMs with a performance improvement of 3x to 5x over state-of-the-art approaches. In the second part of this thesis, we investigate the fault tolerance of CNNs, another effective deep learning architecture. CNNs represent a dominating solution in image classification tasks, but suffer from a high performance cost, due to their computational structure. In fact, due to their large parameter space, fetching their data from main memory typically becomes a performance bottleneck. In order to tackle the problem, various techniques for their parameters compression have been developed, such as weight pruning, weight clustering and weight quantization. However, reducing the memory footprint of an application can lead to its data becoming more sensitive to faults. For this thesis work, we have conducted an analysis to verify the conditions for applying OddECC, a mechanism that supports variable strength and size ECCs for different memory regions. Our experiments reveal that compressed CNNs, which have their memory footprint reduced up to 86.3x by utilizing the aforementioned compression schemes, exhibit accuracy drops up to 13.56% in presence of random single bit faults

Chalmers Research

Cooperative high-performance computing with FPGAs - matrix multiply case-study

Author: Munafo Robert
Publication venue
Publication date: 03/07/2018
Field of study

In high-performance computing, there is great opportunity for systems that use FPGAs to handle communication while also performing computation on data in transit in an ``altruistic'' manner--that is, using resources for computation that might otherwise be used for communication, and in a way that improves overall system performance and efficiency. We provide a specific definition of \textbf{Computing in the Network} that captures this opportunity. We then outline some overall requirements and guidelines for cooperative computing that include this ability, and make suggestions for specific computing capabilities to be added to the networking hardware in a system. We then explore some algorithms running on a network so equipped for a few specific computing tasks: dense matrix multiplication, sparse matrix transposition and sparse matrix multiplication. In the first instance we give limits of problem size and estimates of performance that should be attainable with present-day FPGA hardware

Boston University Institutional Repository (OpenBU)

Assessing the Performance of OpenTitan as Cryptographic Accelerator in Secure Open-Hardware System-on-Chips

Author: Acquaviva Andrea
Barchi Francesco
Bartolini Andrea
Ciani Maicol
Musa Alberto
Parisi Emanuele
Rossi Davide
Publication venue
Publication date: 15/02/2024
Field of study

RISC-V open-source systems are emerging in deployment scenarios where safety and security are critical. OpenTitan is an open-source silicon root-of-trust designed to be deployed in a wide range of systems, from high-end to deeply embedded secure environments. Despite the availability of various cryptographic hardware accelerators that make OpenTitan suitable for offloading cryptographic workloads from the main processor, there has been no accurate and quantitative establishment of the benefits derived from using OpenTitan as a secure accelerator. This paper addresses this gap by thoroughly analysing strengths and inefficiencies when offloading cryptographic workloads to OpenTitan. The focus is on three key IPs - HMAC, AES, and OpenTitan Big Number accelerator (OTBN) - which can accelerate four security workloads: Secure Hash Functions, Message Authentication Codes, Symmetric cryptography, and Asymmetric cryptography. For every workload, we develop a bare-metal driver for the OpenTitan accelerator and analyze its efficiency when computation is offloaded from a RISC-V application core within a System-on-Chip designed for secure Cyber-Physical Systems applications. Finally, we assess it against a software implementation on the application core. The characterization was conducted on a cycle-accurate RTL simulator of the System-on-Chip (SoC). Our study demonstrates that OpenTitan significantly outperforms software implementations, with speedups ranging from 4.3x to 12.5x. However, there is potential for even greater gains as the current OpenTitan utilizes a fraction of the accelerator bandwidths, which ranges from 16% to 61%, depending on the memory being accessed and the accelerator used. Our results open the way to the optimization of OpenTitan-based secure platforms, providing design guidelines to unlock the full potential of its accelerators in secure applications.Comment: 8 pages, 2 figures, accepted at CF'24 conference, pre camera-ready versio

arXiv.org e-Print Archive