125 research outputs found

    Number theoretic techniques applied to algorithms and architectures for digital signal processing

    Get PDF
    Many of the techniques for the computation of a two-dimensional convolution of a small fixed window with a picture are reviewed. It is demonstrated that Winograd's cyclic convolution and Fourier Transform Algorithms, together with Nussbaumer's two-dimensional cyclic convolution algorithms, have a common general form. Many of these algorithms use the theoretical minimum number of general multiplications. A novel implementation of these algorithms is proposed which is based upon one-bit systolic arrays. These systolic arrays are networks of identical cells with each cell sharing a common control and timing function. Each cell is only connected to its nearest neighbours. These are all attractive features for implementation using Very Large Scale Integration (VLSI). The throughput rate is only limited by the time to perform a one-bit full addition. In order to assess the usefulness to these systolic arrays a 'cost function' is developed to compare them with more conventional techniques, such as the Cooley-Tukey radix-2 Fast Fourier Transform (FFT). The cost function shows that these systolic arrays offer a good way of implementing the Discrete Fourier Transform for transforms up to about 30 points in length. The cost function is a general tool and allows comparisons to be made between different implementations of the same algorithm and between dissimilar algorithms. Finally a technique is developed for the derivation of Discrete Cosine Transform (DCT) algorithms from the Winograd Fourier Transform Algorithm. These DCT algorithms may be implemented by modified versions of the systolic arrays proposed earlier, but requiring half the number of cells

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    New FFT/IFFT Factorizations with Regular Interconnection Pattern Stage-to-Stage Subblocks

    Get PDF
    Les factoritzacions de la FFT (Fast Fourier Transform) que presenten un patró d’interconnexió regular entre factors o etapes son conegudes com algorismes paral·lels, o algorismes de Pease, ja que foren originalment proposats per Pease. En aquesta contribució s’han desenvolupat noves factoritzacions amb blocs que presenten el patró d’interconnexió regular de Pease. S’ha mostrat com aquests blocs poden ser obtinguts a una escala prèviament seleccionada. Les noves factoritzacions per ambdues FFT i IFFT (Inverse FFT) tenen dues classes de factors: uns pocs factors del tipus Cooley-Tukey i els nous factors que proporcionen la mateix patró d’interconnexió de Pease en blocs. Per a una factorització donada, els blocs comparteixen dimensions, el patró d’interconnexió etapa a etapa i a més cada un d’ells pot ser calculat independentment dels altres.FFT (Fast Fourier Transform) factorizations presenting a regular interconnection pattern between factors or stages are known as parallel algorithms, or Pease algorithms since were first proposed by Pease. In this paper, new FFT/IFFT (Inverse FFT) factorizations with blocks that exhibit regular Pease interconnection pattern are derived. It is shown these blocks can be obtained at a previously selected scale. The new factorizations for both the FFT and IFFT have two kinds of factors: a few Cooley-Tukey type factors and new factors providing the same Pease interconnection pattern property in blocks. For a given factorization, these blocks share dimensions, the interconnection pattern stage-to-stage, and all of them can be calculated independently from one another.Las factoritzaciones de la FFT (Fast Fourier Transform) que presentan un patrón de interconexiones regular entre factores o etapas son conocidas como algoritmos paralelos, o algoritmos de Pease, puesto que fueron originalmente propuestos por Pease. En esta contribución se han desarrollado nuevas factoritzaciones en subbloques que presentan el patrón de interconexión regular de Pease. Se ha mostrado como estos bloques pueden ser obtenidos a una escalera previamente seleccionada. Las nuevas factoritzaciones para ambas FFT y IFFT (Inverse FFT) tienen dos clases de factores: unos pocos factores del tipo Cooley-Tukey y los nuevos factores que proporcionan el mismo patrón de interconexión de Pease en bloques. Para una factoritzación dada, los bloques comparten dimensiones, patrón d’interconexión etapa a etapa y además cada uno de ellos puede ser calculado independientemente de los otros

    DFT algorithms for bit-serial GaAs array processor architectures

    Get PDF
    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology

    Hardware Acceleration of Video analytics on FPGA using OpenCL

    Get PDF
    abstract: With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place. To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency. This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.Dissertation/ThesisMasters Thesis Electrical Engineering 201
    corecore