Search CORE

117 research outputs found

Multi-operand Decimal Adder Trees for FPGAs

Author: de Dinechin Florent
Vazquez Alvaro
Publication venue: HAL CCSD
Publication date: 14/10/2010
Field of study

The research and development of hardware designs for decimal arithmetic is currently going under an intense activity. For most part, the methods proposed to implement fixed and floating point operations are intended for ASIC designs. Thus, a direct mapping or adaptation of these techniques into a FPGA could be far from an optimal solution. Only a few studies have considered new methods more suitable for FPGA implementations. A basic operation that has not received enough attention in this context is multi-operand BCD addition. For example, it is of interest for low latency implementations of decimal fixed and floating point multipliers and decimal fused multiply-add units. We have explored the most representative proposals for multi-operand BCD addition and found that the resultant implementations in FPGAs are still very inefficient in terms of both area and latency when compared to their binary counterparts. In this paper we present a new method for fast and efficient implementation of multi-operand BCD addition in current FPGA devices. In particular, our proposal maps quite well into the slice structure of the Xilinx Virtex-5/Virtex-6 families and it is highly pipelineable. The synthesis results for a Virtex-6 device indicate that our implementations halve the area and latency of previous proposals, presenting area and delay figures close to those of optimal binary adder trees.La recherche sur l'implantation en matériel de l'arithmétique décimale est actuellement très active, la plupart des travaux portant sur des opérateurs pour les processeurs, en virgule fixe ou flottante. Mais les techniques développées pour un circuit intégré n'aboutissent pas forcément à une implémentation optimale dans un FPGA. Il n'y a que peu d'études ciblant explicitement les FPGA. Cet article s'intéresse dans ce contexte, à l'addition BCD multi-opérande, au cœur de multiplieurs et de multiplieurs-accumulateurs à faible latence. Nous étudions les architectures proposées pour cette opération décimale, et nous observons que, sur FPGA, leur performance (surface et latence) est très inférieure à celle des opérations binaire à précision comparable. Nous présentons donc dans cet article une nouvelle technique d'addition BCD multi-opérandes qui s'avère plus efficace que les propositions précédentes sur les FPGA actuels. Elle s'adapte particulièrement bien à la structure fine des FPGA Xilinx Virtex-5/Virtex-6, et se prête bien au pipeline. Les résultats de synthèse montrent que notre implémentation divise par deux la surface et la latence par rapport aux propositions précédentes, les ramenant à des valeurs comparables à celles des meilleurs additionneurs multi-opérandes binaires

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Comparison of Pipelined Floating Point Unit with Unpipelined Floating Point Unit

Author: Deepika K. Naga
Latha K. Bhargavee
Pavan K.R.
Triveni J.
Venkateswarao M.
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 23/04/2015
Field of study

Floating-point numbers are broadly received in numerous applications due their element representation abilities. Floating-point representation has the capacity hold its determination and exactness contrasted with altered point representations. Any Digital Signal Processing (DSP) calculations utilization floating-point math, which obliges a huge number of figuring’s every second to be performed. For such stringent necessities, outline of quick, exact and effective circuits is the objective of each VLSI creator. This paper displays a correlation of pipelined floating-point snake dissention with IEEE 754 organization with an unpipelined viper additionally protests with IEEE 754 arrangement. It depicts the IEEE floating-point standard 754. A pipelined floating point unit in light of IEEE 754 configuration is produced and the outline is contrasted and that of an unpipelined floating point unit and an investigation is defeated speed, range, and force contemplations. It builds the rate as well as is vitality productive. Every one of these changes is at the expense of slight increment in the chip region. The basic methodology and approach used for VHDL (Very Large Scale Integration Hardware Descriptive Language) implementation of the floating-point unit are also described. Detailed synthesis report operated upon Xilinx ISE 11 software and Modelsim is given

Journal of Telematics and Informatics

Optimisations arithmétiques et synthèse de haut niveau

Author: Uguen Yohann
Publication venue: HAL CCSD
Publication date: 13/11/2019
Field of study

High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés

Series Expansion based Efficient Architectures for Double Precision Floating Point Division

Author: Balakrishnan M
Cheung RCC
Jaiswal MK
Paul K
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

postprin

HKU Scholars Hub

Single-Precision and Double-Precision Merged Floating-Point Multiplication and Addition Units on FPGA

Author: Zhang Hao
Publication venue: 'University of Saskatchewan Library'
Publication date: 11/02/2020
Field of study

Floating-point (FP) operations defined in IEEE 754-2008 Standard for Floating-Point Arithmetic can provide wider dynamic range and higher precision than fixed-point operations. Many scientific computations and multimedia applications adopt FP operations. Among all the FP operations, addition and multiplication are the most frequent operations. In this thesis, the single-precision (SP) and double-precision (DP) merged FP multiplier and FP adder architectures are proposed. The proposed efficient iterative FP multiplier is designed based on the Karatsuba algorithm and implemented with the pipelined architecture. It can accomplish two parallel SP multiplication operations in one iteration with a latency of 6 clock cycles or one DP multiplication operation in two iterations with a latency of 9 clock cycles. Implemented on Xilinx Virtex-5 (xc5vlx155ff1760-3) FPGA device, the proposed multiplier runs at 348 MHz using 6 DSP48E blocks, 1117 LUTs, and 1370 FFs. Compared to previous FPGA based multiple-precision FP multiplier, the proposed designs runs at 4% faster clock frequency with reduction of 33% of DSP blocks, 17% latency for SP multiplication, and 28% latency for DP multiplication. The proposed high performance FP adder is designed based one the two-path FP addition algorithm. With fully pipelined architecture, the proposed adder can accomplish one DP or two parallel SP addition/subtraction operations in 6 clock cycles. The proposed adder architecture is implemented on both Altera and Xilinx 65nm process FPGA devices. The proposed adder can run up to 336 MHz with 1694 FFs, 1420 LUTs on Xilinx Virtex-5 (xc5vlx155ff1760-3) FPGA device. Compared to the combination of one DP and two SP architecture built with Xilinx FP operator, the proposed adder has 11.3% faster clock frequency. On Altera Stratix-III (EP3SL340F1760C2) FPGA device, the maximum clock frequency of the proposed adder can reach 358 MHz and 1686 ALUTs and 1556 registers are occupied. The proposed adder is 11.6% faster than the combination of one DP and two SP architecture built with Altera FP megafunction. For the reference of other researchers, the implementation results of the proposed FP multiplier and FP adder on the latest Xilinx Virtex-7 device and Altera Arria 10 device are also provided

University of Saskatchewan Research Archive

Design-space exploration for the Kulisch accumulator

Author: de Dinechin Florent
Uguen Yohann
Publication venue: HAL CCSD
Publication date: 20/03/2017
Field of study

Floating-point sums and dot products accumulate rounding errors that may render the result very inaccurate. To address this, Kulisch proposed to use an internal accumulator large enough to cover the full exponent range of floating-point. With it, sums and dot products become exact operations. This idea failed to materialize in general purpose processors, as it was considered to slow and/or too expensive in terms of resources. It may however be an interesting option in recon-figurable computing, where a designer may use use smaller, more resource-efficient floating-point formats, knowing that sums and dot products will be exact. Another motivation of this work is that these exact operations, contrary to classical floating point ones, are associative, which enables better compiler optimizations. This work therefore compares, in the context of modern FPGAs, several implementations of the Kulisch accumulator: three proposed by Kulisch, and two novel ones. These architectures are implemented in a VivadoHLS-compliant C++ generator that is fully customiz-able. Comparisons targeting Xilinx's Kintex 7 FPGAs show improvement over Kulisch' proposal in both area and speed. In single precision, compared with a naive use of classical operators , the proposed accumulator runs at similar frequency, consumes 10x more resource in single precision, but reduces the overall latency of a large dot product by 25x while vastly improving accuracy

INRIA a CCSD electronic archive server

Energy-efficient design and implementation of approximate floating-point multiplier

Author: Paparouni Theodora
Παπαρούνη Θεοδώρα
Publication venue
Publication date: 28/01/2020
Field of study

DSpace at NTUA