1,444 research outputs found

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Efficient FPGA Acceleration of Convolutional Deep Neural Networks

    Get PDF
    Department of Computer EngineeringDeep Convolutional Neural Networks (CNNs) are a powerful model for visual recognition tasks, but due to their very high computational requirement, acceleration is highly desired. FPGA accelerators for CNNs are typically built around one large MAC (multiply-accumulate) array, which is repeatedly used to perform the computation of all convolution layers, which can be quite diverse and complex. Thus a key challenge is how to design a common architecture that can perform well for all convolutional layers. In this paper we present a highly optimized and cost-effective 3D neuron array architecture that is a natural FFt for convolutional layers, along with a parameter selection framework to optimize its parameters for any given CNN model. We show through theoretical as well as empirical analyses that structuring compute elements in a 3D rather than a 2D topology can lead to higher performance through an improved utilization of key FPGA resources. Our experimental results targeting a Virtex-7 FPGA demonstrate that our proposed technique can generate CNN accelerators that can outperform the state-of-the-art solution, by 1.80x to maximum 4.05x for 32-bit ??floating-point, and 16-bit fixed-point MAC implementation respectively for different CNN models. Additionally, our proposed technique can generate designs that are far more scalable in terms of compute resources. We also report on the energy consumption of our accelerator in comparison with a GPGPU implementation.ope

    Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

    Full text link
    Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet^{\dagger}. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only 1×11\times 1 convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7\% top-5), but with 42×\times fewer parameters and 48×\times fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1\% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 66.3 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 11.6×\times.Comment: Update to the latest result

    Multi-LSTM Acceleration and CNN Fault Tolerance

    Get PDF
    This thesis addresses the following two problems related to the field of Machine Learning: the acceleration of multiple Long Short Term Memory (LSTM) models on FPGAs and the fault tolerance of compressed Convolutional Neural Networks (CNN). LSTMs represent an effective solution to capture long-term dependencies in sequential data, like sentences in Natural Language Processing applications, video frames in Scene Labeling tasks or temporal series in Time Series Forecasting. In order to further boost their efficacy, especially in presence of long sequences, multiple LSTM models are utilized in a Hierarchical and Stacked fashion. However, because of their memory-bounded nature, efficient mapping of multiple LSTMs on a computing device becomes even more challenging. The first part of this thesis addresses the problem of mapping multiple LSTM models to a FPGA device by introducing a framework that modifies their memory requirements according to the target architecture. For the similar accuracy loss, the proposed framework maps multiple LSTMs with a performance improvement of 3x to 5x over state-of-the-art approaches. In the second part of this thesis, we investigate the fault tolerance of CNNs, another effective deep learning architecture. CNNs represent a dominating solution in image classification tasks, but suffer from a high performance cost, due to their computational structure. In fact, due to their large parameter space, fetching their data from main memory typically becomes a performance bottleneck. In order to tackle the problem, various techniques for their parameters compression have been developed, such as weight pruning, weight clustering and weight quantization. However, reducing the memory footprint of an application can lead to its data becoming more sensitive to faults. For this thesis work, we have conducted an analysis to verify the conditions for applying OddECC, a mechanism that supports variable strength and size ECCs for different memory regions. Our experiments reveal that compressed CNNs, which have their memory footprint reduced up to 86.3x by utilizing the aforementioned compression schemes, exhibit accuracy drops up to 13.56% in presence of random single bit faults

    Cooperative high-performance computing with FPGAs - matrix multiply case-study

    Get PDF
    In high-performance computing, there is great opportunity for systems that use FPGAs to handle communication while also performing computation on data in transit in an ``altruistic'' manner--that is, using resources for computation that might otherwise be used for communication, and in a way that improves overall system performance and efficiency. We provide a specific definition of \textbf{Computing in the Network} that captures this opportunity. We then outline some overall requirements and guidelines for cooperative computing that include this ability, and make suggestions for specific computing capabilities to be added to the networking hardware in a system. We then explore some algorithms running on a network so equipped for a few specific computing tasks: dense matrix multiplication, sparse matrix transposition and sparse matrix multiplication. In the first instance we give limits of problem size and estimates of performance that should be attainable with present-day FPGA hardware

    Assessing the Performance of OpenTitan as Cryptographic Accelerator in Secure Open-Hardware System-on-Chips

    Full text link
    RISC-V open-source systems are emerging in deployment scenarios where safety and security are critical. OpenTitan is an open-source silicon root-of-trust designed to be deployed in a wide range of systems, from high-end to deeply embedded secure environments. Despite the availability of various cryptographic hardware accelerators that make OpenTitan suitable for offloading cryptographic workloads from the main processor, there has been no accurate and quantitative establishment of the benefits derived from using OpenTitan as a secure accelerator. This paper addresses this gap by thoroughly analysing strengths and inefficiencies when offloading cryptographic workloads to OpenTitan. The focus is on three key IPs - HMAC, AES, and OpenTitan Big Number accelerator (OTBN) - which can accelerate four security workloads: Secure Hash Functions, Message Authentication Codes, Symmetric cryptography, and Asymmetric cryptography. For every workload, we develop a bare-metal driver for the OpenTitan accelerator and analyze its efficiency when computation is offloaded from a RISC-V application core within a System-on-Chip designed for secure Cyber-Physical Systems applications. Finally, we assess it against a software implementation on the application core. The characterization was conducted on a cycle-accurate RTL simulator of the System-on-Chip (SoC). Our study demonstrates that OpenTitan significantly outperforms software implementations, with speedups ranging from 4.3x to 12.5x. However, there is potential for even greater gains as the current OpenTitan utilizes a fraction of the accelerator bandwidths, which ranges from 16% to 61%, depending on the memory being accessed and the accelerator used. Our results open the way to the optimization of OpenTitan-based secure platforms, providing design guidelines to unlock the full potential of its accelerators in secure applications.Comment: 8 pages, 2 figures, accepted at CF'24 conference, pre camera-ready versio
    corecore