50 research outputs found

    Heuristic datapath allocation for multiple wordlength systems

    No full text
    Published versio

    Optimized Architectural Synthesis of Fixed-Point Datapaths

    Full text link
    In this paper we address the time-constrained architectural synthesis of fixed-point DSP algorithms using FPGA devices. Optimized fixed-point implementations are obtained by means of considering: (i) a multiple wordlength approach; (ii) a complete datapath formed of wordlength-wise resources (i.e. functional units, multiplexers and registers); and, (iii) a novel resource usage metric that enables the wise distribution of logic fabric and embedded DSP resources. The paper shows: (i) the benefits of applying a multiple wordlength approach to the implementation of fixedpoint datapaths; and (ii) the benefits of a wise use of embedded FPGA resources. The proposed metric enables area improvements up to 54% and the use of a complete fixed-point datapath leads to improvements up to 35%

    Mapping for maximum performance on FPGA DSP blocks

    Get PDF
    The digital signal processing (DSP) blocks on modern field programmable gate arrays (FPGAs) are highly capable and support a variety of different datapath configurations. Unfortunately, inference in synthesis tools can fail to result in circuits that reach maximum DSP block throughput. We have developed a tool that maps graphs of add/sub/mult nodes to DSP blocks on Xilinx FPGAs, ensuring maximum throughput. This is done by delaying scheduling until after the graph has been partitioned onto DSP blocks and scheduled based on their pipeline structure, resulting in a throughput optimized implementation. Our tool prepares equivalent implementations in a variety of other methods, including high-level synthesis (HLS) for comparison. We show that the proposed approach offers an improvement in frequency of 100% over standard pipelined code, and 23% over Vivado HLS synthesis implementation, while retaining code portability, at the cost of a modest increase in logic resource usage

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Design and application of reconfigurable circuits and systems

    No full text
    Open Acces

    Wordlength optimization for linear digital signal processing

    No full text
    Published versio

    Multiobjective Optimization for Reconfigurable Implementation of Medical Image Registration

    Get PDF
    In real-time signal processing, a single application often has multiple computationally intensive kernels that can benefit from acceleration using custom or reconfigurable hardware platforms, such as field-programmable gate arrays (FPGAs). For adaptive utilization of resources at run time, FPGAs with capabilities for dynamic reconfiguration are emerging. In this context, it is useful for designers to derive sets of efficient configurations that trade off application performance with fabric resources. Such sets can be maintained at run time so that the best available design tradeoff is used. Finding a single, optimized configuration is difficult, and generating a family of optimized configurations suitable for different run-time scenarios is even more challenging. We present a novel multiobjective wordlength optimization strategy developed through FPGA-based implementation of a representative computationally intensive image processing application: medical image registration. Tradeoffs between FPGA resources and implementation accuracy are explored, and Pareto-optimized wordlength configurations are systematically identified. We also compare search methods for finding Pareto-optimized design configurations and demonstrate the applicability of search based on evolutionary techniques for identifying superior multiobjective tradeoff curves. We demonstrate feasibility of this approach in the context of FPGA-based medical image registration; however, it may be adapted to a wide range of signal processing applications

    Mapping for Maximum Performance on FPGA DSP Blocks

    Full text link

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numérique, un nombre fractionnaire peut être représenté à l’aide de l’arithmétique en virgule fixe ou en virgule flottante. L’arithmétique en virgule fixe est largement considérée préférable à celle en virgule flottante pour les architectures matérielles dédiées en raison de sa plus faible complexité d’implémentation. Dans la mise en œuvre du matériel, la largeur de mot attribuée à différents signaux a un impact significatif sur des métriques telles que les ressources (transistors), la vitesse et la consommation d'énergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise à optimiser les chemins de données par l'ajustement des longueurs de mots attribuées aux signaux. Un nombre en virgule fixe est composé d’une partie entière et d’une partie fractionnaire. Il y a une limite inférieure au nombre de bits alloués à la partie entière, de façon à prévenir les débordements pour chaque signal. Cette limite dépend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant à lui, détermine la taille de l'erreur de précision finie qui est introduite dans les calculs. Il existe un compromis entre la précision et l'efficacité du matériel dans la sélection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procédures importantes: la modélisation de l'erreur de quantification et la sélection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont porté sur des circuits spécialisés comme plate-forme cible. Dans cette thèse, nous introduisons de nouvelles méthodologies, techniques et algorithmes pour améliorer l’implémentation de calculs en virgule fixe dans des circuits et processeurs spécialisés. La thèse propose une approche améliorée de modélisation d’erreur, basée sur l'arithmétique affine, qui aborde certains problèmes des méthodes existantes et améliore leur précision. La thèse introduit également une technique d'accélération et deux algorithmes semi-analytiques pour la sélection de la largeur de la partie fractionnaire pour la conception de circuits spécialisés. Alors que le premier algorithme suit une stratégie de recherche progressive, le second utilise une méthode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexité de calcul et le coût résultant. Le premier algorithme a une complexité polynomiale et obtient des résultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexité exponentielle, mais il donne des résultats quasi-optimaux par rapport à une recherche exhaustive. Cette thèse propose également une méthode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unités fonctionnelles sont les principaux objectifs ciblés par cette optimisation. Un nouvel algorithme d'optimisation a été développé pour trouver la meilleure combinaison de longueurs de mots et d'autres paramètres configurables dans la méthode proposée. Les exigences de précision, définies comme l'erreur pire cas, doivent être respectées par toute solution. Pour faciliter l'évaluation et la mise en œuvre des solutions retenues, un nouvel environnement de conception de processeur a également été développé. Cet environnement, qui est appelé PolyCuSP, supporte une large gamme de paramètres, y compris ceux qui sont nécessaires pour évaluer les solutions proposées par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacité de modéliser différents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari
    corecore