62 research outputs found

    FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

    Full text link
    A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

    Hardware Acceleration of Video analytics on FPGA using OpenCL

    Get PDF
    abstract: With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place. To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency. This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.Dissertation/ThesisMasters Thesis Electrical Engineering 201

    SECDA-TFLite: a toolkit for efficient development of FPGA-based DNN accelerators for edge inference

    Get PDF
    In this paper we propose SECDA-TFLite, a new open source toolkit for developing DNN hardware accelerators, integrated within the TFLite framework. The toolkit leverages the principles of SECDA, a hardware/software co-design methodology, to reduce the design time of optimized DNN inference accelerators on edge devices with FPGAs. With SECDA-TFLite, we reduce the initial setup costs associated with integrating a new accelerator design within a target DNN framework, allowing developers to focus on the design. SECDA-TFLite also includes modules for cost-effective SystemC simulation, profiling, and AXI-based data communication. As a case study, we use SECDA-TFLite to develop and evaluate three accelerator designs across seven common CNN models and two BERT-based models against an ARM A9 CPU-only baseline, achieving an average performance speedup across models of up to 3.4Ă— for the CNN models and of up to 2.5Ă— for the BERT-based models. Our code is available at https://github.com/gicLAB/SECDA-TFLite

    Customizable vector acceleration in extreme-edge computing. A risc-v software/hardware architecture study on VGG-16 implementation

    Get PDF
    Computing in the cloud-edge continuum, as opposed to cloud computing, relies on high performance processing on the extreme edge of the Internet of Things (IoT) hierarchy. Hardware acceleration is a mandatory solution to achieve the performance requirements, yet it can be tightly tied to particular computation kernels, even within the same application. Vector-oriented hardware acceleration has gained renewed interest to support artificial intelligence (AI) applications like convolutional networks or classification algorithms. We present a comprehensive investigation of the performance and power efficiency achievable by configurable vector acceleration subsystems, obtaining evidence of both the high potential of the proposed microarchitecture and the advantage of hardware customization in total transparency to the software program

    SoWaF: Shuffling of Weights and Feature Maps: A Novel Hardware Intrinsic Attack (HIA) on Convolutional Neural Network (CNN)

    Get PDF
    Security of inference phase deployment of Convolutional neural network (CNN) into resource-constrained embedded systems (e.g., low-end FPGAs) is a growing research area. Using secure practices, third-party FPGA designers can be provided with no knowledge of initial and final classification layers. In this work, we demonstrate that hardware intrinsic attack (HIA) in such a “secure” design is still possible. Proposed HIA is inserted inside mathematical operations of individual layers of CNN, which propagates erroneous operations in all the subsequent CNN layers that lead to misclassification. The attack is non-periodic and completely random; hence it becomes difficult to detect. Five different attack scenarios with respect to each CNN layer are designed and evaluated based on the overhead resources and the rate of triggering in comparison to the original implementation. Our results for two CNN architectures show that in all the attack scenarios, additional latency is negligible (<0.61%), increment in DSP, LUT, FF is also less than2.36%. Three attack scenarios do not require any additional BRAM resources, while in two scenarios BRAM increases, which compensates with the corresponding decrease in FF and LUTs. To the authors’ best knowledge this work is the first to address the hardware intrinsic CNN attack where the attacker does not have knowledge of the full CNN

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201
    • …
    corecore