928 research outputs found

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

    Design exploration and performance strategies towards power-efficient FPGA-based achitectures for sound source localization

    Get PDF
    Many applications rely on MEMS microphone arrays for locating sound sources prior to their execution. Those applications not only are executed under real-time constraints but also are often embedded on low-power devices. These environments become challenging when increasing the number of microphones or requiring dynamic responses. Field-Programmable Gate Arrays (FPGAs) are usually chosen due to their flexibility and computational power. This work intends to guide the design of reconfigurable acoustic beamforming architectures, which are not only able to accurately determine the sound Direction-Of-Arrival (DoA) but also capable to satisfy the most demanding applications in terms of power efficiency. Design considerations of the required operations performing the sound location are discussed and analysed in order to facilitate the elaboration of reconfigurable acoustic beamforming architectures. Performance strategies are proposed and evaluated based on the characteristics of the presented architecture. This power-efficient architecture is compared to a different architecture prioritizing performance in order to reveal the unavoidable design trade-offs

    Minimalistic SDHC-SPI hardware reader module for boot loader applications

    Get PDF
    This paper introduces a low-footprint full hardware boot loading solution for FPGA-based Programmable Systems on Chip. The proposed module allows loading the system code and data from a standard SD card without having to re-program the whole embedded system. The hardware boot loader is processor independent and removes the need of a software boot loader and the related memory resources. The hardware overhead introduced is manageable, even in low-range FPGA chips, and negligible in mid- and high-range devices. The implementation of the SD card reader module is explained in detail and an example of a multi-boot loader is offered as well. The multi-boot loader is implemented and tested with the Xilinx's Picoblaze microcontroller

    Hardware Implementations of a Deep Learning Approach to Optimal Configuration of Reconfigurable Intelligence Surfaces

    Get PDF
    Reconfigurable intelligent surfaces (RIS) offer the potential to customize the radio propagation environment for wireless networks, and will be a key element for 6G communications. However, due to the unique constraints in these systems, the optimization problems associated to RIS configuration are challenging to solve. This paper illustrates a new approach to the RIS configuration problem, based on the use of artificial intelligence (AI) and deep learning (DL) algorithms. Concretely, a custom convolutional neural network (CNN) intended for edge computing is presented, and implementations on different representative edge devices are compared, including the use of commercial AI-oriented devices and a field-programmable gate array (FPGA) platform. This FPGA option provides the best performance, with x20 performance increase over the closest FP32, GPU-accelerated option, and almost x3 performance advantage when compared with the INT8-quantized, TPU-accelerated implementation. More noticeably, this is achieved even when high-level synthesis (HLS) tools are used and no custom accelerators are developed. At the same time, the inherent reconfigurability of FPGAs opens a new field for their use as enabler hardware in RIS applications.This work is part of the project TED2021-129938B-I00, funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR

    High Level Synthesis, a Use Case Comparison with Hardware Description Language

    Get PDF
    This paper compares Vivado High-Level Synthesis (HLS), a new mainstream technology offered by Xilinx Inc., against the typical Hardware Description Language (HDL) design approach. An example video filter application was implemented via both methods and compared for differences in performance and Non-Reoccurring Engineering (NRE). Lessons learned using HLS are also provided. The objective of this paper is to provide actual comparison data on the current state of mainstream HLS to enable informed decision making for designs considering HLS. The Xilinx Zync System on a Chip (SoC) offering is used as a platform for both the traditional HDL methods and HLS. This platform includes Field Programmable Gate Array (FPGA) fabric combined with a high speed application microprocessor. These single silicon SoC solutions appear to be a platform capable of effectively utilizing HLS. The example video application selected for implementation is a 9 by 9 kernel convolution filter performed on 24 bit 1080p video at 60 frames per second. The 2013 Xilinx Vivado tool suite was used for both HLS and HDL methods. HLS proved to be very easy to use to create a functional RTL design. With naïve implementations in both, HLS did not perform well in resource utilization. HLS also provided a design with a slower maximum clock frequency

    HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA

    Full text link
    Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host processor with programmable manycore accelerators (PMCAs) to combine general-purpose computing with domain-specific, efficient processing capabilities. While leading companies successfully advance their HESoC products, research lags behind due to the challenges of building a prototyping platform that unites an industry-standard host processor with an open research PMCA architecture. In this work we introduce HERO, an FPGA-based research platform that combines a PMCA composed of clusters of RISC-V cores, implemented as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host processor. The PMCA architecture mapped on the FPGA is silicon-proven, scalable, configurable, and fully modifiable. HERO includes a complete software stack that consists of a heterogeneous cross-compilation toolchain with support for OpenMP accelerator programming, a Linux driver, and runtime libraries for both host and PMCA. HERO is designed to facilitate rapid exploration on all software and hardware layers: run-time behavior can be accurately analyzed by tracing events, and modifications can be validated through fully automated hard ware and software builds and executed tests. We demonstrate the usefulness of HERO by means of case studies from our research
    • …
    corecore