547 research outputs found

    Multi-Softcore Architecture on FPGA

    Get PDF
    To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication)

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    Designing parameterizable hardware IPs in a model-based design environment for high-level synthesis

    Get PDF
    Model-based hardware design allows one to map a single model to multiple hardware and/or software architectures, essentially eliminating one of the major limitations of manual coding in C or RTL. Model-based design for hardware implementation has traditionally offered a limited set of microarchitectures, which are typically suitable only for some application scenarios. In this article we illustrate how digital signal processing (DSP) algorithms can be modeled as flexible intellectual property blocks to be used within the popular Simulink model-based design environment. These blocks are written in C and are designed for both functional simulation and hardware implementation, including architectural design space exploration and hardware implementation through high-level synthesis. A key advantage of our modeling approach is that the very same bit-accurate model is used for simulation and high-level synthesis. To prove the feasibility of our proposed approach, we modeled a fast Fourier transform (FFT) algorithm and synthesized it for different DSP applications with very different performance and cost requirements. We also implemented a high-level-synthesis (HLS) intellectual property (IP) generator that can generate flexible FFT HLS-IP blocks that can be mapped to multiple micro-/macroarchitectures, to enable design space exploration as well as being used for functional simulation in the Simulink environment.</jats:p

    EMVS: Embedded Multi Vector-core System

    Get PDF
    With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.Peer ReviewedPostprint (author's final draft

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

    Get PDF
    This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs

    Embedded video stabilization system on field programmable gate array for unmanned aerial vehicle

    Get PDF
    Unmanned Aerial Vehicles (UAVs) equipped with lightweight and low-cost cameras have grown in popularity and enable new applications of UAV technology. However, the video retrieved from small size UAVs is normally in low-quality due to high frequency jitter. This thesis presents the development of video stabilization algorithm implemented on Field Programmable Gate Array (FPGA). The video stabilization algorithm consists of three main processes, which are motion estimation, motion stabilization and motion compensation to minimize the jitter. Motion estimation involves block matching and Random Sample Consensus (RANSAC) to estimate the affine matrix that defines the motion perspective between two consecutive frames. Then, parameter extraction, motion smoothing and motion vector correction, which are parts of the motion stabilization, are tasked in removing unwanted camera movement. Finally, motion compensation stabilizes two consecutive frames based on filtered motion vectors. In order to facilitate the ground station mobility, this algorithm needs to be processed onboard the UAV in real-time. The nature of parallelization of video stabilization processing is suitable to be utilized by using FPGA in order to achieve real-time capability. The implementation of this system is on Altera DE2-115 FPGA board. Full hardware dedicated cores without Nios II processor are designed in stream-oriented architecture to accelerate the computation. Furthermore, a parallelized architecture consisting of block matching and highly parameterizable RANSAC processor modules show that the proposed system is able to achieve up to 30 frames per second processing and a good stabilization improvement up to 1.78 Interframe Transformation Fidelity value. Hence, it is concluded that the proposed system is suitable for real-time video stabilization for UAV application

    Reconfigurable Logic Embedded Architecture of Support Vector Machine Linear Kernel

    Get PDF
    Support Vector  Machine  (SVM) is a linear  binary classifier  that  requires a  kernel  function  to  handle  non-linear problems.  Most  previous  SVM  implementations for  embedded systems  in literature were  built  targeting a certain  application; where analyses were done through comparison  with software im- plementations only. The impact  of different  application datasets towards  SVM hardware performance were not analyzed.  In this work,  we propose  a parameterizable linear  kernel  architecture that  is fully pipelined.  It  is prototyped and  analyzed  on Altera Cyclone  IV  platform   and  results  are  verified  with  equivalent software  model.  Further analysis  is  done  on  determining the effect  of  the  number of  features   and  support   vectors  on  the performance of the  hardware architecture. From  our  proposed linear  kernel  implementation, the number of features  determine the maximum  operating frequency  and amount  of logic resource utilization,  whereas  the  number of support   vectors  determines the  amount  of on-chip  memory  usage  and  also the  throughput of the system
    corecore