11 research outputs found

    Residue Number System Reconfigurable Datapath

    Get PDF
    ABSTRACT In this paper we describe a possible approach to implement a reconfigurable datapath for digital signal processing. The datapath should be programmable in terms of dynamic range, type and sequence of operations. We chose to implement it in the Residue Number System (RNS), because the RNS offers high speed and low power dissipation. Results show that the RNS reconfigurable datapath offers better performance and lower power dissipation when compared, on the same set of applications, with a traditional FIR filter of the same characteristics

    Embedded Co-Processor Architecture for CMOS Based Image Acquisition

    Get PDF
    This paper describes a new co-processor architecture designed for CMOS sensor imaging. The co-processor unit is integrated into the image acquisition loop so as to exploit the full potential of CMOS selective access imaging technology. The processing features of the coprocessor are functional to the specific acquisition process of CMOS sensors (random region acquisition, variable image size, variable acquisition modes line/region based, multi-exposition images). Moreover, although built with pipelined or parallel HW processing modules, the co-processor architecture has been designed so as to obtain a unit that can be configured on the fly, in terms of type and number of chained processing, during the image acquisition process that is defined by the application. Simulated performances based on a FPGA implementation, are reported and compared to classical image acquisition systems based on PC platforms

    Recurrently Decomposable 2-D Convolvers for FPGA-Based Digital Image Processing

    Full text link

    Smart camera with embedded co-processor: a postal sorting application

    Get PDF
    This work describes an image acquisition and processing system based on a new co-processor architecture designed for CMOS sensor imaging. The platform permits to configure a wide variety of acquisition modes (random region acquisition, variable image size, multi-exposition image) as well as high-performance image pre-processing (filtering, de-noising, binarisation, pattern recognition). Furthermore, the acquisition is driven by an FPGA, as well as a processing stage followed by a Nexperia processor. The data transfer, from the FPGAs board to the Nexperia processor, can be pipelined to the co-processor to increase achievable throughput performances. The co-processor architecture has been designed so as to obtain a unit that can be configured on the fly, in terms of type and number of chained processing (up to 8 successive pre-defined pre-processing), during the image acquisition process that is dynamically defined by the application. Examples of acquisition and processing performances are reported and compared to classical image acquisition systems based on standard modular PC platforms. The experimental results show a considerable increase of the performances. For instance the reading of bar codes with applications to postal sorting on a PC platform is limited to about 15 images (letters) per second. The new platform beside resulting more compact and easily installable in hostile environments can successfully analyze up to 50 images/s

    Reconfigurable pipelined 2-D convolvers for fast digital signal processing

    No full text

    Conception de processeurs spécialisés pour le traitement vidéo en temps réel par filtre local

    Get PDF
    RÉSUMÉ Ce mémoire décrit les travaux visant à explorer les possibilités qu'offrent les processeurs à jeu d'instructions spécialisé pour des applications de vidéo numérique. Spécifiquement une classe particulière d'algorithmes de traitement vidéo est considérée: les filtres locaux. Pour cette classe d'algorithmes, une exploration architecturale a permis d'identifier un ensemble de techniques formant une approche cohérente et systématique pour la conception de processeurs spécialisés performants adaptés au traitement vidéo en temps réel. L'approche de conception proposée vise une utilisation efficace de la bande passante vers la mémoire, laquelle bande passante constitue le goulot d'étranglement de l'application du point de vue de la vitesse de traitement. Il est possible d'approcher la performance limite imposée par ce goulot par une stratégie appropriée de réutilisation des données et en exploitant le parallélisme des données inhérent à la classe d'algorithmes visée. L'approche comporte quatre étapes: tout d'abord, une instruction parallèle (SIMD) qui effectue le calcul de plusieurs pixels de sortie à la fois est créée. Puis, des registres à décalage permettant la réutilisation intra-ligne des pixels d'entrée sont ajoutés. Ensuite, un pipeline est créé par le découpage de l'instruction parallèle et l'ajout de registres pour les résultats intermédiaires. Finalement, les instructions spécialisées de chargement et de sauvegarde sont créées. Quelques-unes de ces étapes ouvrent la porte à des simplifications matérielles spécifiques pour certains algorithmes de la classe cible. La structure matérielle obtenue au final, alliée à la parallélisation des instructions par l'utilisation d'une architecture VLIW, se comporte d'une manière semblable à un réseau systolique pipeliné. Afin de démontrer expérimentalement la validité de l'approche de conception proposée, sept processeurs spécialisés pour des algorithmes de la classe visée ont été conçus par extension du jeu d'instructions d'un processeur configurable à jeu d'instructions extensible. Trois de ces processeurs spécialisés mettent en œuvre autant d'algorithmes de désentrelacement intra-trames, et quatre visent plutôt la convolution 2D, différant entre eux par la taille de la fenêtre de convolution. Les résultats de performance obtenus sont prometteurs. Pour les algorithmes de désentrelacement intra-trames, les facteurs d'accélération varient entre 95 et 1330, alors que les facteurs d'amélioration du produit temps-surface varient entre 29 et 243, tout ceci par rapport à un processeur d'usage général de référence roulant une implémentation purement logicielle de l'algorithme.----------ABSTRACT This master thesis explores the possibilities offered by Application-Specific Instruction-Set Processors (ASIP) for digital video applications, more specifically for a particular algorithm class used for video processing: local neighbourhood functions. For this algorithm class, an architectural exploration lead to the identification of a set of design techniques which, together, form a coherent and systematic approach for the design of high performance ASIPs usable for real-time video processing. The proposed design approach aims at an efficient utilization of available bandwidth to memory, which constitutes the main performance bottleneck of the application. It is possible to approach the processing speed limit imposed by this bottleneck through an appropriate data reuse strategy and by exploiting the data parallelism inherent to the target algorithm class. The design approach comprises four steps: first, a Single Instruction Multiple Data (SIMD) instruction which calculates more than one pixel in parallel is created. Then, shift registers, which are used for intra-line input pixel reuse, are added. Next, a processing pipeline is created by the addition of application-specific registers. Finally, the custom load/store instructions are created. Some of these steps lead to possible hardware simplifications for some algorithms of the target class. The hardware structure thus obtained, together with the instruction-level parallelism made possible through the use of a Very Long Instruction Word (VLIW) architecture, mimics a pipelined systolic array. In order to demonstrate the validity of the proposed design approach experimentally, seven ASIPs have been designed by extending the instruction-set of a configurable and extensible processor. Three of the ASIPs implement intra-field deinterlacing algorithms, and four implement the 2D convolution with different kernel sizes. The results show a significant improvement in performance. For the intra-field deinterlacing algorithms, speedup factors are between 95 and 1330, while the factors of improvement of the Area-Time (AT) product are between 29 and 243, all this compared to a pure software implementation running on a general-purpose processor. In the case of the two-dimensional convolution, speedup factors are between 36 and 80, while factors of improvement of the AT product are between 12 and 22. In all cases, real-time processing of high definition video in the 1080i (deinterlacing) or 1080p (convolution) format is possible given a 130 nm manufacturing process

    Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

    Get PDF
    The increasing capacity and capabilities of FPGA devices in recent years provide an attractive option for performance-hungry applications in the image and video processing domain. FPGA devices are often used as implementation platforms for image and video processing algorithms for real-time applications due to their programmable structure that can exploit inherent spatial and temporal parallelism. While performance and area remain as two main design criteria, power consumption has become an important design goal especially for mobile devices. Reduction in power consumption can be achieved by reducing the supply voltage, capacitances, clock frequency and switching activities in a circuit. Switching activities can be reduced by architectural optimization of the processing cores such as adders, multipliers, multiply and accumulators (MACS), etc. This dissertation research focuses on reducing the switching activities in digital circuits by considering data dependencies in bit level, word level and block level neighborhoods in a video frame. The bit level data neighborhood dependency consideration for power reduction is illustrated in the design of pipelined array, Booth and log-based multipliers. For an array multiplier, operands of the multipliers are partitioned into higher and lower parts so that the probability of the higher order parts being zero or one increases. The gating technique for the pipelined approach deactivates part(s) of the multiplier when the above special values are detected. For the Booth multiplier, the partitioning and gating technique is integrated into the Booth recoding scheme. In addition, a delay correction strategy is developed for the Booth multiplier to reduce the switching activities of the sign extension part in the partial products. A novel architecture design for the computation of log and inverse-log functions for the reduction of power consumption in arithmetic circuits is also presented. This also utilizes the proposed partitioning and gating technique for further dynamic power reduction by reducing the switching activities. The word level and block level data dependencies for reducing the dynamic power consumption are illustrated by presenting the design of a 2-D convolution architecture. Here the similarities of the neighboring pixels in window-based operations of image and video processing algorithms are considered for reduced switching activities. A partitioning and detection mechanism is developed to deactivate the parallel architecture for window-based operations if higher order parts of the pixel values are the same. A neighborhood dependent approach (NDA) is incorporated with different window buffering schemes. Consideration of the symmetry property in filter kernels is also applied with the NDA method for further reduction of switching activities. The proposed design methodologies are implemented and evaluated in a FPGA environment. It is observed that the dynamic power consumption in FPGA-based circuit implementations is significantly reduced in bit level, data level and block level architectures when compared to state-of-the-art design techniques. A specific application for the design of a real-time video processing system incorporating the proposed design methodologies for low power consumption is also presented. An image enhancement application is considered and the proposed partitioning and gating, and NDA methods are utilized in the design of the enhancement system. Experimental results show that the proposed multi-level power aware methodology achieves considerable power reduction. Research work is progressing In utilizing the data dependencies in subsequent frames in a video stream for the reduction of circuit switching activities and thereby the dynamic power consumption

    Efficient FPGA Architectures for Separable Filters and Logarithmic Multipliers and Automation of Fish Feature Extraction Using Gabor Filters

    Get PDF
    Convolution and multiplication operations in the filtering process can be optimized by minimizing the resource utilization using Field Programmable Gate Arrays (FPGA) and separable filter kernels. An FPGA architecture for separable convolution is proposed to achieve reduction of on-chip resource utilization and external memory bandwidth for a given processing rate of the convolution unit. Multiplication in integer number system can be optimized in terms of resources, operation time and power consumption by converting to logarithmic domain. To achieve this, a method altering the filter weights is proposed and implemented for error reduction. The results obtained depict significant error reduction when compared to existing methods, thereby optimizing the multiplication in terms of the above mentioned metrics. Underwater video and still images are used by many programs within National Oceanic Atmospheric and Administration (NOAA) fisheries with the objective of identifying, classifying and quantifying living marine resources. They use underwater cameras to get video recording data for manual analysis. This process of manual analysis is labour intensive, time consuming and error prone. An efficient solution for this problem is proposed which uses Gabor filters for feature extraction. The proposed method is implemented to identify two species of fish namely Epinephelus morio and Ocyurus chrysurus. The results show higher rate of detection with minimal rate of false alarms

    Hardware Acceleration of Deep Convolutional Neural Networks on FPGA

    Get PDF
    abstract: The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility. As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance. Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance. Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Estimation par analyse statique de la bande-passante d'accélérateurs en synthèse de haut niveau sur FPGA

    Get PDF
    L’accélération par coprocesseur sur FPGA de portions d’algorithmes logiciels exécutés sur un CPU à usage général est une solution utilisée depuis longtemps dans de nombreux systèmes embarqués lorsque le calcul à effectuer est trop complexe ou la quantité de données à traiter trop grande pour être réalisée par ce processeur trop général pour les contraintes de performance et de puissance données. Avec la fin de la loi de Moore, c’est également une option de plus en plus utilisée dans les centres de données pour pallier à la croissance exponentielle de la consommation de courant des approches CPU et GPGPU. De plus, la réalisation de ces coprocesseurs, bien que restant une tâche plus complexe que la simple programmation d’un processeur, est énormément facilitée par la démocratisation des logiciels de synthèse de haut niveau (HLS), qui permettent la transformation automatisée de code écrit en langages logiciels (généralement un sous-ensemble statique du C/C++) vers des langages de description matérielle synthétisables (VHDL/Verilog). Bien qu’il soit souvent nécessaire d’apporter des modifications au code source pour obtenir de bons résultats, les outils de synthèse de haut niveau comportent généralement un estimateur de performance rapide de la micro-architecture développée, ce qui facilite un flot de développement itératif. Cependant, en pratique, le potentiel de parallélisme et de concurrence des accélérateurs sur FPGA est souvent limité par la bande-passante vers la mémoire contenant les données à traiter ou par la latence des communications entre l’accélérateur et le processeur général qui le contrôle. De plus, l’estimation de cette bande-passante est un problème plus complexe qu’il ne paraît du premier coup d’œil, dépendant notamment de la taille et de la séquentialité des accès, du nombre d’accès simultanés, de la fréquence des différentes composantes du système, etc. Cette bande-passante varie également d’une configuration de contrôleur mémoire à une autre et le tout se complexifie avec les FPGA-SoC (SoC incluant processeurs physiques et partie logique programmable), qui comportent plusieurs chemins des données fixes différents vers leur partie FPGA. Finalement, dans la majorité des cas, la bande-passante atteignable est plus faible que le maximum théorique fourni avec la documentation du fabricant. Cette problématique fait en sorte que bien que les outils existants permettent d’estimer facilement la performance du coprocesseur isolé, cette estimation ne peut être fiable sans considérer comment il est connecté au système mémoire. Les seuls moyens d’avoir des métriques de performance fiables sont donc la simulation ou la synthèse et exécution du système complet. Cependant, alors que l’estimation de performance du coprocesseur isolé ne prend que quelques secondes, la simulation ou la synthèse augmente ce délai à quelques dizaines de minutes, ce qui augmente le temps de mise en marché ou mène à l’utilisation de solutions sous-optimales faute de temps de développement.----------ABSTRACT: FPGA acceleration of portions of code otherwise executed on a general purpose processor is a well known and frequently used solution for speeding up the execution of complex and data-heavy algorithms. This has been the case for around two decades in embedded systems, where power constraints limit the usefulness of inefficient general purpose solutions. However, with the end of Dennard scaling and Moore’s law, FPGA acceleration is also increasingly used in datacenters, where traditional CPU and GPGPU approaches are limited by the always increasing current consumption required by many modern applications such as big data and machine learning. Furthermore, the design of FPGA coprocessors, while still more complex than writing software, is facilitated by the recent democratization of High-Level Synthesis (HLS) tools, which allow the automated translation of high-level software to a hardware description (VHDL/Verilog) equivalent. While it is still generally necessary to modify the high-level code in order to produce good results, HLS tools usually ship with a fast performance estimator of the resulting micro-architecture, allowing for fast iterative development methodologies. However, while FPGAs have great potential for parallelism and concurrence, in practice they are often limited by memory bandwidth and/or by the communications latency between the coprocessor and the general purpose CPU controlling it. In addition, estimating this memory bandwidth is much more complex than it can appear at first glance, since it depends on the size of the data transfer, the order of the accesses, the number of simultaneous accesses to memory, the width of the accessed data, the clock speed of both the FPGA and the memory, etc. This bandwidth also differs from one memory controller configuration to the other, and then everything is made more complex when SoC-FPGAs (SoCs including a hard processor and programmable logic) come into play, since they contain multiple different datapaths between the programmable logic and the hard memory controller. Finally, this bandwidth is almost always different (and smaller) than the maximum theoretical bandwidth given by the manufacturer’s documentation. Thus, while existing HLS tools can easily estimate the coprocessor’s performance if it is isolated from the rest of the system, they do not take into account how this performance is affected by the achievable memory bandwidth. This makes the simulation of the whole system or its synthesis-then-execution the only trustworthy ways to get a good performance estimation. However, while the HLS tool’s performance estimation runtime is a matter of a few seconds, simulation or synthesis takes tens of minutes, which considerably slows down iterative development flows. This increased delay increases time-to-market and can lead to suboptimal solutions due to the extra development time needed
    corecore