10 research outputs found

    Multipliers for Floating-Point Double Precision and Beyond on FPGAs

    Get PDF
    International audienceThe implementation of high-precision floating-point applications on reconfigurable hardware requires a variety of large multipliers: Standard multipliers are the core of floating-point multipliers; Truncated multipliers, trading resources for a well-controlled accuracy degradation, are useful building blocks in situations where a full multiplier is not needed. This work studies the automated generation of such multipliers using the embedded multipliers and adders present in DSP blocks of current FPGAs. The optimization of such multipliers is expressed as a tiling problem where a tile represents a hardware multiplier and super-tiles are the wiring of several hardware multipliers making efficient use of the DSP internal resources. This tiling technique is shown to adapt to full or truncated multipliers. It addresses arbitrary precisions including single, double but also in the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware. An open-source implementation is provided in the FloPoCo project

    Resource Efficient Single Precision Floating Point Multiplier Using Karatsuba Algorithm

    Get PDF
    In floating point arithmetic operations, multiplication is the most required operation for many signal processing and scientific applications. 24-bit length mantissa multiplication is involved to obtain the floating point multiplication final result for two given single precision floating point numbers. This mantissa multiplication plays the major role in the performance evaluation in respect of occupied area and propagation delay. This paper presents the design and analysis of single precision floating point multiplication using karatsuba algorithm with vedic multiplier with the considering of modified 2x1 multiplexers and modified 4:2 compressors in order to overcome the drawbacks in the existing techniques. Further, the performance analysis of single precision floating point multiplier is analyzed in terms of area and delay using Karatsuba Algorithm with different existing techniques such as 4x1 multiplexers and 3:2 compressors and modified techniques such as 2x1 multiplexers, 4:2 compressors. From the simulation results, it is observed that single precision floating point multiplication with karatsuba algorithm using modified 4:2 compressor with XOR-MUX logic provides better performance with efficient usage of resources such as area and delay than that of existing techniques. All the blocks involved for floating point multiplication are coded with Verilog and synthesized using Xilinx ISE Simulator

    DSP48E efficient floating point multiplier architectures on FPGA

    Get PDF
    postprin

    Single-Precision and Double-Precision Merged Floating-Point Multiplication and Addition Units on FPGA

    Get PDF
    Floating-point (FP) operations defined in IEEE 754-2008 Standard for Floating-Point Arithmetic can provide wider dynamic range and higher precision than fixed-point operations. Many scientific computations and multimedia applications adopt FP operations. Among all the FP operations, addition and multiplication are the most frequent operations. In this thesis, the single-precision (SP) and double-precision (DP) merged FP multiplier and FP adder architectures are proposed. The proposed efficient iterative FP multiplier is designed based on the Karatsuba algorithm and implemented with the pipelined architecture. It can accomplish two parallel SP multiplication operations in one iteration with a latency of 6 clock cycles or one DP multiplication operation in two iterations with a latency of 9 clock cycles. Implemented on Xilinx Virtex-5 (xc5vlx155ff1760-3) FPGA device, the proposed multiplier runs at 348 MHz using 6 DSP48E blocks, 1117 LUTs, and 1370 FFs. Compared to previous FPGA based multiple-precision FP multiplier, the proposed designs runs at 4% faster clock frequency with reduction of 33% of DSP blocks, 17% latency for SP multiplication, and 28% latency for DP multiplication. The proposed high performance FP adder is designed based one the two-path FP addition algorithm. With fully pipelined architecture, the proposed adder can accomplish one DP or two parallel SP addition/subtraction operations in 6 clock cycles. The proposed adder architecture is implemented on both Altera and Xilinx 65nm process FPGA devices. The proposed adder can run up to 336 MHz with 1694 FFs, 1420 LUTs on Xilinx Virtex-5 (xc5vlx155ff1760-3) FPGA device. Compared to the combination of one DP and two SP architecture built with Xilinx FP operator, the proposed adder has 11.3% faster clock frequency. On Altera Stratix-III (EP3SL340F1760C2) FPGA device, the maximum clock frequency of the proposed adder can reach 358 MHz and 1686 ALUTs and 1556 registers are occupied. The proposed adder is 11.6% faster than the combination of one DP and two SP architecture built with Altera FP megafunction. For the reference of other researchers, the implementation results of the proposed FP multiplier and FP adder on the latest Xilinx Virtex-7 device and Altera Arria 10 device are also provided

    Resource Optimal Truncated Multipliers for FPGAs

    Get PDF
    International audienceThis proposal presents the resource optimal design of truncated multipliers targeting field programmable gate arrays (FPGAs). In contrast to application specific integrated circuits (ASICs), the design for FPGAs has some distinct design challenges due to many possibilities of computing the partial products using logic-based or DSP-based sub-multipliers. To tackle this, we extend a previously proposed tiling methodology which translates the multiplier design into a geometrical problem: the target multiplier is represented by a board that has to be covered by tiles representing the sub-multipliers. The tiling with the least resources can be found with integer linear programming (ILP). Our extension considers the error of possibly unoccupied positions of the board and determines the tiling with the least resources that respects the maximal allowed error bound. This error bound is chosen such that a faithfully rounded truncated multiplier is obtained. Compared to previous designs that use a fixed number of guard bits or optimize at the level of the dot diagrams, this allows a much better use of sub-multipliers resulting in significant area savings without sacrificing the timing

    Computing SpMV on FPGAs

    Get PDF
    There are hundreds of papers on accelerating sparse matrix vector multiplication (SpMV), however, only a handful target FPGAs. Some claim that FPGAs inherently perform inferiorly to CPUs and GPUs. FPGAs do perform inferiorly for some applications like matrix-matrix multiplication and matrix-vector multiplication. CPUs and GPUs have too much memory bandwidth and too much floating point computation power for FPGAs to compete. However, the low computations to memory operations ratio and irregular memory access of SpMV trips up both CPUs and GPUs. We see this as a leveling of the playing field for FPGAs. Our implementation focuses on three pillars: matrix traversal, multiply-accumulator design, and matrix compression. First, most SpMV implementations traverse the matrix in row-major order, but we mix column and row traversal. Second, To accommodate the new traversal the multiply accumulator stores many intermediate y values. Third, we compress the matrix to increase the transfer rate of the matrix from RAM to the FPGA. Together these pillars enable our SpMV implementation to perform competitively with CPUs and GPUs

    Opérateurs et engins de calcul en virgule flottante et leur application à la simulation en temps réel sur FPGA

    Get PDF
    RÉSUMÉ La simulation en temps réel des réseaux électriques connaît un vif intérêt industriel, motivé par la réduction substantielle des coûts de développement qu'offre une telle approche de prototypage. Ainsi, la simulation en temps réel permet d'intégrer dans la boucle de la simulation du matériel au fur et à mesure sa conception, permettant du même coup d'en vérifier le bon fonctionnement dans des conditions réalistes. Néanmoins, la simulation en temps réel au moyen de CPU, telle qu'elle a été pensée depuis une quinzaine d'années, souffre de certaines limitations, notamment dans l'atteinte de pas de calcul de l'ordre de quelques micro-secondes, un requis important pour la simulation fidèle des transitoires rapides qu'exigent les convertisseurs de puissance modernes. Pour tenter d'apporter une réponse à ces difficultés, les industriels ont adopté les circuits FPGA pour la réalisation d'engins de calcul dédiés à la simulation rapide des réseaux électriques, ce qui a permis de franchir la barrière de la fréquence de commutation de 5 kHz qui était caractéristique de la simulation sur CPU. La simulation sur FPGA offre à ce titre différents avantages telle que la réduction de la latence de la boucle de simulation du matériel sous test, particulièrement du fait que le FPGA donne un accès direct aux senseurs et aux actuateurs du dispositif en cours de prototypage. Les paradigmes usuels du traitement de signal sur FPGA font qu'il est d'usage d'y opérer une arithmétique à virgule fixe. Ce format des nombres pénalise le temps de développement puisqu'il requiert du concepteur une évaluation complexe de la précision nécessaire pour représenter l'ensemble des variables du modèle mathématique. C'est pourquoi l'arithmétique à virgule flottante suscite un certain intérêt dans la simulation des réseaux sur FPGA. Cependant, les opérateurs en virgule flottante imposent de longues latences, particulièrement handicapantes dans la réalisation de lois d'intégration (trapézoïdale, Euler-arrière, etc.) pour lesquelles l'utilisation d'un accumulateur à un cycle est cruciale. En cela, la problématique de l'addition et de l'accumulation en virgule flottante forme le cœur de notre travail de recherche. Ce travail a permis l'élaboration des architectures d'accumulateurs, de multiplieurs accumulateurs (MAC) et d'opérateurs de produit scalaire (OPS) en virgule flottante, qui joueront un rôle déterminant dans la mise en œuvre de nos engins de calcul pour la simulation des réseaux électriques. Ainsi, le travail présenté dans cette thèse propose différentes contributions scientifiques au domaine de la simulation en temps réel sur FPGA. D'une part, il contribue à la formulation d'un algorithme de sommation qui est une généralisation de la technique d'auto-alignement, nantie ici d'une formulation et d'une réalisation matérielle simplifiées. Le travail établit les critères permettant de garantir la bonne exactitude des résultats, critères que nous avons établis par des démonstrations théoriques et empiriques. La thèse propose également une analyse exhaustive de l'utilisation du format redondant high radix carry-save (HRCS) dans l'addition de mantisses larges, format pour lequel deux nouveaux opérateurs arithmétiques sont proposés: un additionneur endomorphique ainsi qu'un convertisseur HRCS à conventionnel. Une fois l'addition en virgule flottante à un cycle réalisée, la thèse propose de concevoir sur FPGA des engins de calcul exploitant une architecture SIMD (single instruction, multiple data) et disposant de plusieurs MAC ou opérateurs de produit scalaire (OPS) en virgule flottante. Ces opérateurs présentent une latence très courte, permettant l'atteinte de pas de calcul de quelques centaines de nanosecondes dans la simulation de convertisseurs de puissance de moyenne complexité.----------ABSTRACT The real-time simulation of electrical networks gained a vivid industrial interest during recent years, motivated by the substantial development cost reduction that such a prototyping approach can offer. Real-time simulation allows the progressive inclusion of real hardware during its development, allowing its testing under realistic conditions. However, CPU-based simulations suffer from certain limitations such as the difficulty to reach time-steps of a few microsecond, an important challenge brought by modern power converters. Hence, industrial practitioners adopted the FPGA as a platform of choice for the implementation of calculation engines dedicated to the rapid real-time simulation of electrical networks. The reconfigurable technology broke the 5~kHz switching frequency barrier that is characteristic of CPU-based simulations. Moreover, FPGA-based real-time simulation offers many advantages, including the reduced latency of the simulation loop that is obtained thanks to a direct access to sensors and actuators. The fixed-point format is paradigmatic to FPGA-based digital signal processing. However, the format imposes a time penalty in the development process since the designer has to asses the required precision for all model variables. This fact brought an import research effort on the use of the floating-point format for the simulation of electrical networks. One of the main challenges in the use of the floating-point format are the long latencies required by the elementary arithmetic operators, particularly when an adder is used as an accumulator, an important building block for the implementation of integration rules such as the trapezoidal method. Hence, single-cycle floating-point accumulation forms the core of this research work. Our results help building such operators as accumulators, multiply-accumulators (MACs), and dot-product (DP) operators. These operators play a key role in the implementation of the proposed calculation engines. Therefore, this thesis contributes to the realm of FPGA-based real-time simulation in many ways. The research work proposes a new summation algorithm, which is a generalization of the so-called self-alignment technique. The new formulation is broader, simpler in its expression and hardware implementation. Our research helps formulating criteria to guarantee good accuracy, the criteria being established on a theoretical, as well as empirical basis. Moreover, the thesis offers a comprehensive analysis on the use of the redundant high radix carry-save (HRCS) format. The HRCS format is used to perform rapid additions of large mantissas. Two new HRCS operators are also proposed, namely an endomorphic adder and a HRCS to conventional converter. Once the mean to single-cycle accumulation is defined as a combination of the self-alignment technique and the HRCS format, the research focuses on the FPGA implementation of SIMD calculation engines using parallel floating-point MACs or DPs. The proposed operators are characterized by low latencies, allowing the engines to reach very low time-steps. The document finally discusses power electronic circuits modelling, and concludes with the presentation of a versatile calculation engine capable of simulating power converter with arbitrary topologies and up to 24 switches, while achieving time steps below 1 μs and allowing switching frequencies in the range of tens kilohertz
    corecore