79 research outputs found

    Fast decimal floating-point division

    Get PDF
    A new implementation for decimal floating-point (DFP) division is introduced. The algorithm is based on high-radix SRT division The SRT division algorithm is named after D. Sweeney, J. E. Robertson, and T. D. Tocher. with the recurrence in a new decimal signed-digit format. Quotient digits are selected using comparison multiples, where the magnitude of the quotient digit is calculated by comparing the truncated partial remainder with limited precision multiples of the divisor. The sign is determined concurrently by investigating the polarity of the truncated partial remainder. A timing evaluation using a logic synthesis shows a significant decrease in the division execution time in contrast with one of the fastest DFP dividers reported in the open literatureHooman Nikmehr, Braden Phillips and Cheng-Chew Li

    Dedicated Hardware for Complex Mathematical Operations

    Get PDF
    New hardware FPGA implementations for the efficient computations of division, natural logarithm and exponential function are proposed. The proposed implementations use generic floating-point adder and multiplier with small additional resources that are shared to compute more frequently used multiply and accumulate operations. Hardware sharing improved the resource utilization. The time of the computation has been reduced to only 6 clock cycles when the natural logarithm and exponential function are calculated. The division is calculated in 5 clock cycles. They are designed as technology independent high throughput computing cores with minimized memory requirements which can be used in higher numbers to significantly increased calculation speed in spectral processing. A new universal arithmetic floating-point unit is also proposed

    Algorithms and architectures for decimal transcendental function computation

    Get PDF
    Nowadays, there are many commercial demands for decimal floating-point (DFP) arithmetic operations such as financial analysis, tax calculation, currency conversion, Internet based applications, and e-commerce. This trend gives rise to further development on DFP arithmetic units which can perform accurate computations with exact decimal operands. Due to the significance of DFP arithmetic, the IEEE 754-2008 standard for floating-point arithmetic includes it in its specifications. The basic decimal arithmetic unit, such as decimal adder, subtracter, multiplier, divider or square-root unit, as a main part of a decimal microprocessor, is attracting more and more researchers' attentions. Recently, the decimal-encoded formats and DFP arithmetic units have been implemented in IBM's system z900, POWER6, and z10 microprocessors. Increasing chip densities and transistor count provide more room for designers to add more essential functions on application domains into upcoming microprocessors. Decimal transcendental functions, such as DFP logarithm, antilogarithm, exponential, reciprocal and trigonometric, etc, as useful arithmetic operations in many areas of science and engineering, has been specified as the recommended arithmetic in the IEEE 754-2008 standard. Thus, virtually all the computing systems that are compliant with the IEEE 754-2008 standard could include a DFP mathematical library providing transcendental function computation. Based on the development of basic decimal arithmetic units, more complex DFP transcendental arithmetic will be the next building blocks in microprocessors. In this dissertation, we researched and developed several new decimal algorithms and architectures for the DFP transcendental function computation. These designs are composed of several different methods: 1) the decimal transcendental function computation based on the table-based first-order polynomial approximation method; 2) DFP logarithmic and antilogarithmic converters based on the decimal digit-recurrence algorithm with selection by rounding; 3) a decimal reciprocal unit using the efficient table look-up based on Newton-Raphson iterations; and 4) a first radix-100 division unit based on the non-restoring algorithm with pre-scaling method. Most decimal algorithms and architectures for the DFP transcendental function computation developed in this dissertation have been the first attempt to analyze and implement the DFP transcendental arithmetic in order to achieve faithful results of DFP operands, specified in IEEE 754-2008. To help researchers evaluate the hardware performance of DFP transcendental arithmetic units, the proposed architectures based on the different methods are modeled, verified and synthesized using FPGAs or with CMOS standard cells libraries in ASIC. Some of implementation results are compared with those of the binary radix-16 logarithmic and exponential converters; recent developed high performance decimal CORDIC based architecture; and Intel's DFP transcendental function computation software library. The comparison results show that the proposed architectures have significant speed-up in contrast to the above designs in terms of the latency. The algorithms and architectures developed in this dissertation provide a useful starting point for future hardware-oriented DFP transcendental function computation researches

    RISC-V Core Instruction Extension Sets M and F

    Get PDF
    This thesis project presents the hardware design of the components capable of implementing a 5-stages core RV32I, RV32IM with integer multiplication and division expansion, and RV32IMF with partial single-precision floating-point support. These have been developed using Verilog HDL and based on the RISC-V ISA. Furthermore, these designs have been verified and synthesised on "bare-metal" using the FPGA platform from the DE0 development board. In addition, a custom variety of division modules have been produced to offer performance diversity on frequency of operation, resource allocation and number of clock cycles per division operations. The selection of these modules provides implementation options that allow to personalize the product to the customer needs

    HIGH-SPEED CO-PROCESSORS BASED ON REDUNDANT NUMBER SYSTEMS

    Get PDF
    There is a growing demand for high-speed arithmetic co-processors for use in applications with computationally intensive tasks. For instance, Fast Fourier Transform (FFT) co-processors are used in real-time multimedia services and financial applications use decimal co-processors to perform large amounts of decimal computations. Using redundant number systems to eliminate word-wide carry propagation within interim operations is a well-known technique to increase the speed of arithmetic hardware units. Redundant number systems are mostly useful in applications where many consecutive arithmetic operations are performed prior to the final result, making it advantageous for arithmetic co-processors. This thesis discusses the implementation of two popular arithmetic co-processors based on redundant number systems: namely, the binary FFT co-processor and the decimal arithmetic co-processor. FFT co-processors consist of several consecutive multipliers and adders over complex numbers. FFT architectures are implemented based on fixed-point and floating-point arithmetic. The main advantage of floating-point over fixed-point arithmetic is the wide dynamic range it introduces. Moreover, it avoids numerical issues such as scaling and overflow/underflow concerns at the expense of higher cost. Furthermore, floating-point implementation allows for an FFT co-processor to collaborate with general purpose processors. This offloads computationally intensive tasks from the primary processor. The first part of this thesis, which is devoted to FFT co-processors, proposes a new FFT architecture that uses a new Binary-Signed Digit (BSD) carry-limited adder, a new floating-point BSD multiplier and a new floating-point BSD three-operand adder. Finally, a new unit labeled as Fused-Dot-Product-Add (FDPA) is designed to compute AB+CD+E over floating-point BSD operands. The second part of the thesis discusses decimal arithmetic operations implemented in hardware using redundant number systems. These operations are popularly used in decimal floating-point co-processors. A new signed-digit decimal adder is proposed along with a sequential decimal multiplier that uses redundant number systems to increase the operational frequency of the multiplier. New redundant decimal division and square-root units are also proposed. The architectures proposed in this thesis were all implemented using Hardware-Description-Language (Verilog) and synthesized using Synopsys Design Compiler. The evaluation results prove the speed improvement of the new arithmetic units over previous pertinent works. Consequently, the FFT and decimal co-processors designed in this thesis work with at least 10% higher speed than that of previous works. These architectures are meant to fulfill the demand for the high-speed co-processors required in various applications such as multimedia services and financial computations

    Design and Implementation of a Radix-100 Division Unit

    Get PDF
    In this thesis, a novel radix-100 divider is designed and implemented. The proposed divider is 3% faster then the current decimal dividers

    Design of efficient reversible floating-point arithmetic unit on field programmable gate array platform and its performance analysis

    Get PDF
    The reversible logic gates are used to improve the power dissipation in modern computer applications. The floating-point numbers with reversible features are added advantage to performing complex algorithms with high-performance computations. This manuscript implements an efficient reversible floating-point arithmetic (RFPA) unit, and its performance metrics are realized in detail. The RFP adder/subtractor (A/S), RFP multiplier, and RFP divider units are designed as a part of the RFP arithmetic unit. The RFPA unit is designed by considering basic reversible gates. The mantissa part of the RFP multiplier is created using a 24x24 Wallace tree multiplier. In contrast, the reciprocal unit of the RFP divider is designed using Newton Raphson’s method. The RFPA unit and its submodules are executed in parallel by utilizing one clock cycle individually. The RFPA unit and its submodules are synthesized separately on the Vivado IDE environment and obtained the implementation results on Artix-7 field programmable gate array (FPGA). The RFPA unit utilizes only 18.44% slice look-up tables (LUTs) by consuming the 0.891 W total power on Artix-7 FPGA. The RFPA unit sub-models are compared with existing approaches with better performance metrics and chip resource utilization improvements

    Handbook of Floating-Point Arithmetic

    Full text link

    IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider

    Get PDF
    Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users. In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding. The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW
    • …
    corecore