6,912 research outputs found

    On the design of IEEE compliant floating point units and their quantitative analysis

    Get PDF
    Abstract this thesis addresses the question of which are the important issues in the design of a high-speed floating-point unit (FPU) that is fully compliant with the IEEE floating-point standard 754-1985 [19]. There are a few choices that need to be made when designing an IEEE compliant FPU, among them: the internal representation of floating-point numbers, the rounding algorithms, handling of denormal results, usage of the same rounding hardware for different units (e.g. adder, multiplier, divider), and the implementations of the adder, the multiplier and the divider. These choices influence both the cost and the performance of the FPU. Nevertheless, these issues have not been discussed in the open literature todate. This work begins to fill this gap by designing, analyzing and comparing 18 different IEEE compliant FPU implementations, that consider design options regarding: (a) the internal representation of floating-point numbers; (b) the rounding algorithms; (c) sharing of a rounding unit, the implementation of gradual step rounding or the implementation of dedicated rounding units for each functional unit; (d) the implementation of the floating-point multiplier; and (e) the implementation of the floating-point divider. The presented FPU designs make also use of the following innovations, that were developed in the context of this work: (a) a fast implementation of variable position rounding integrated into a FP multiplier [37]; (b) to the best of our knowledge the fastest integrated FP addition and rounding algorithm published todate [40], (c) the fastest FP multiplication rounding algorithm published todate [11, 12] and (d) the fastest linear reciprocal approximation implementation published todate. [36, 39]; (e) an efficient integration of single and double precision rounding [9]; (f) a Booth encoded adder-tree with an improved cost formula [30]. All the FPUs designed in this work are fully compliant with the IEEE standard for all implemented operations, support both single and double precision, and deal with denormal values and special cases in hardware. Because to design an IEEE compliant FPU is a complex and error-prone task, all the FPU designs are specified in full detail at gate level and the correctness of the FPU designs (in particular the compliance with the IEEE standard) is proven. The proposed FPU implementations are analyed and compared regarding the hardware cost, the cycle time and the performance that they achieve on traces of the SPECfp92 benchmark suite [17] integrated into a pipelined RISC processor from [23]. In this quantitative analysis [38] it is demonstrated that the choice of the rounding architecture in the FPU has a larger impact on the performance of the microprocessor than the choice of the FP multiplication or the FP division implementation. In comparison to this the impact of the rounding architecture choice on the cost is relatively small. The rounding architecture that uses dedicated rounding units provides the best performance with only small additional cost, so that this rounding architecture seems to be the best choice in floating-point implementations. The fast implementation of this rounding architecture is only made possible by the fast variable position rounding implementation for multipliers from [37]. This underlines the importance of this technique.In dieser Arbeit wird der Frage nachgegangen, welches die wichtigsten Designentscheidungen bei der Implementierung einer schnellen Gleitkommaeinheit (FPU), die dem IEEE Standard 754-1985 [19] genügt, sind. Es gibt verschiedene Entscheidungen, die beim Entwurf einer IEEE konformen FPU getroffen werden müssen, darunter: die internen Darstellungen der Gleitkomma-(FP) Zahlen, die Rundungsalgorithmen, die Art der Behandlung von denormalisierten Ergebnissen, die Mehrfachverwendung von Teilen der Hardware, wie z.B. die Benutzung derselben Rundungshardware für verschiedene Einheiten, und die Implementierungen des FP Addierers, des FP Multiplizierers und des FP Dividierers. Diese Entscheidungen beeinflussen sowohl die Kosten alsauch die Leistung der FPU. Nichtsdestotrotz wurden diese Entscheidungen bislang nicht in der Literatur diskutiert. Die vorliegende Arbeit setzt in dieser Lücke an. Es werden 18 unterschiedliche FPUs vorgestellt, analysiert und verglichen, die Optionen zu den folgenden Entscheidungen betrachten: (a) interne Darstellung der FP Zahlen; (b) Rundungsalgorithmen; (c) Gemeinsame Nutzung einer allgemeinen Rundungseinheit, Aufteilen des Rundens in mehrere Schritte und gemeinsame Realisierung einer Teilmenge dieser Schritte oder vollständige eigene Implementierung des Rundens für jede Funktionseinheit; (d) Implementierung des FP Multiplizierers; (e) Implementierung des FP Dividierers. Die vorgestellten FPU Designs benutzen darüberhinaus folgende Neuerungen, die im Rahmen dieser Arbeit entstanden sind: (a) eine schnelle Rundungsimplementierung für den FP Multiplizierer mit variabler Rundungsposition [37]; (b) nach unserem besten Wissen den bisher schnellsten publizierten Algorithmus zum Addieren und Runden von FP Zahlen [40], (c) den bisher schnellsten publizierten Algorithmus zum Runden bei der FP Multiplikation [11, 12] und (d) die bisher schnellste publizierte Implementierung einer linearen Approximation von Reziproken [36, 39]; (e) eine effiziente Integration des Rundens in single precision und double precision [9]; (f) einen Booth-Multiplizierer mit verringerten Kosten [30]. Alle entworfenen FPUs sind für alle implementierten Operationen vollständig konform zum IEEE FP Standard 754, unterstützen sowohl single alsauch double precision Zahlen, und behandeln selbst denormalisierte Ergebnisse und Spezialfälle in Hardware. Weil der Entwurf von IEEE konformen FPUs eine komplexe und fehleranfällige Aufgabe ist, werden sämtliche entworfenen FPUs detailiert auf Gatterebene spezifiziert und ihre Korrektheit (insbesondere die Konformität zum IEEE FP Standard 754) bewiesen. Die vorgestellten FPU Implementierungen werden bezüglich der Hardwarekosten, der Zykluszeit und der Leistung, die sie integriert in einen gepipelinten RISC Processor aus [23] auf Traces der SPECfp92 Benchmark Suite erbringen, analysiert und verglichen. In dieser quantitativen Analyse (siehe auch [38]) wird demonstriert, daß die Auswahl der Rundungs-Architektur einer FPU einen größeren Einfluß auf die Prozessorleistung hat als die Auswahl der Implementierung der FP Multiplikation oder der FP Division. Im Gegensatz dazu ist der Einfluß der Auswahl einer Rundungs-Architektur der FPU auf die Hardwarekosten vergleichsweise gering. Die Rundungs-Architektur, die vollständige eigene Rundungsimplementierungen für jede Funktionseinheit benutzt, liefert bei weitem die beste Leistung und ist lediglich geringfügig teurer als Varianten mit anderen Rundungs-Architekturen. Demzufolge scheint diese Rundungs-Architektur die beste Wahl in FP Implementierungen zu sein. Die schnelle Implementierung dieser Rundungs-Architektur wurde erst durch die schnelle Rundungsimplementierung für FP Multiplizierer mit variabler Rundungsposition nach [37] ermöglicht. Das unterstreicht die Bedeutung dieser Technik

    HW-SW Implementation of a Decoupled FPU for ARM-based Cortex-M1 SoCs in FPGAs

    Get PDF
    Nowadays industrial monoprocessor and multipro- cessor systems make use of hardware floating-point units (FPUs) to provide software acceleration and better precision due to the necessity to compute complex software applications. This paper presents the design of an IEEE-754 compliant FPU, targeted to be used with ARM Cortex-M1 processor on FPGA SoCs. We face the design of an AMBA-based decoupled FPU in order to avoid changing of the Cortex-M1 ARMv6-M architecture and the ARM compiler, but as well to eventually share it among different processors in our Cortex-M1 MPSoC design. Our HW- SW implementation can be easily integrated to enable hardware- assisted floating-point operations transparently from the software application. This work reports synthesis results of our Cortex-M1 SoC architecture, as well as our FPU in Altera and Xilinx FPGAs, which exhibit competitive numbers compared to the equivalent Xilinx FPU IP core. Additionally, single and double precision tests have been performed under different scenarios showing best case speedups between 8.8x and 53.2x depending on the FP operation when are compared to FP software emulation libraries

    An 826 MOPS, 210 uW/MHz Unum ALU in 65 nm

    Full text link
    To overcome the limitations of conventional floating-point number formats, an interval arithmetic and variable-width storage format called universal number (unum) has been recently introduced. This paper presents the first (to the best of our knowledge) silicon implementation measurements of an application-specific integrated circuit (ASIC) for unum floating-point arithmetic. The designed chip includes a 128-bit wide unum arithmetic unit to execute additions and subtractions, while also supporting lossless (for intermediate results) and lossy (for external data movements) compression units to exploit the memory usage reduction potential of the unum format. Our chip, fabricated in a 65 nm CMOS process, achieves a maximum clock frequency of 413 MHz at 1.2 V with an average measured power of 210 uW/MHz

    Optimistic Parallelization of Floating-Point Accumulation

    Get PDF
    Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6Ă— with randomly generated data and 3-7Ă— with summations extracted from Conjugate Gradient benchmarks

    Pipelining Of Double Precision Floating Point Division And Square Root Operations On Field-programmable Gate Arrays

    Get PDF
    Many space applications, such as vision-based systems, synthetic aperture radar, and radar altimetry rely increasingly on high data rate DSP algorithms. These algorithms use double precision floating point arithmetic operations. While most DSP applications can be executed on DSP processors, the DSP numerical requirements of these new space applications surpass by far the numerical capabilities of many current DSP processors. Since the tradition in DSP processing has been to use fixed point number representation, only recently have DSP processors begun to incorporate floating point arithmetic units, even though most of these units handle only single precision floating point addition/subtraction, multiplication, and occasionally division. While DSP processors are slowly evolving to meet the numerical requirements of newer space applications, FPGA densities have rapidly increased to parallel and surpass even the gate densities of many DSP processors and commodity CPUs. This makes them attractive platforms to implement compute-intensive DSP computations. Even in the presence of this clear advantage on the side of FPGAs, few attempts have been made to examine how wide precision floating point arithmetic, particularly division and square root operations, can perform on FPGAs to support these compute-intensive DSP applications. In this context, this thesis presents the sequential and pipelined designs of IEEE-754 compliant double floating point division and square root operations based on low radix digit recurrence algorithms. FPGA implementations of these algorithms have the advantage of being easily testable. In particular, the pipelined designs are synthesized based on careful partial and full unrolling of the iterations in the digit recurrence algorithms. In the overall, the implementations of the sequential and pipelined designs are common-denominator implementations which do not use any performance-enhancing embedded components such as multipliers and block memory. As these implementations exploit exclusively the fine-grain reconfigurable resources of Virtex FPGAs, they are easily portable to other FPGAs with similar reconfigurable fabrics without any major modifications. The pipelined designs of these two operations are evaluated in terms of area, throughput, and dynamic power consumption as a function of pipeline depth. Pipelining experiments reveal that the area overhead tends to remain constant regardless of the degree of pipelining to which the design is submitted, while the throughput increases with pipeline depth. In addition, these experiments reveal that pipelining reduces power considerably in shallow pipelines. Pipelining further these designs does not necessarily lead to significant power reduction. By partitioning these designs into deeper pipelines, these designs can reach throughputs close to the 100 MFLOPS mark by consuming a modest 1% to 8% of the reconfigurable fabric within a Virtex-II XC2VX000 (e.g., XC2V1000 or XC2V6000) FPGA
    • …
    corecore