48 research outputs found

    Low latency digit-recurrence reciprocal and square-root reciprocal algorithm and architecture

    Get PDF
    The reciprocal and square root reciprocal operations are important in several applications such as computer graphics and scientific computation. For the two operations, an algorithm that combines a digit-by-digit module and one iteration of the Newton-Raphson approximation is used. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Since the approximation does not produce correct rounding in a few cases, for applications where exact rounding is required, the result is only computed by the digit-by-digit module. Radix-4 implementations for combined unit are described. The result of the evaluation shows that the cycle time is the same as that of the digit-by-digit unit and that, as a consequence, the execution time is almost halved. Because of the approximation part, the area almost doubles of the digit-by-digit area. In this Dissertation, an implementation algorithm and architecture for low latency digit recurrence reciprocal and squareroot reciprocal has been presented. The implementation is based on Radix-4, considering execution time and system complexity. This project aims to provide modules written in the Very High Speed Integrated Circuit Hardware Description Language (VHDL) that can be used to model low latency digit recurrence reciprocal and square root reciprocal architecture

    Algorithms and architectures for decimal transcendental function computation

    Get PDF
    Nowadays, there are many commercial demands for decimal floating-point (DFP) arithmetic operations such as financial analysis, tax calculation, currency conversion, Internet based applications, and e-commerce. This trend gives rise to further development on DFP arithmetic units which can perform accurate computations with exact decimal operands. Due to the significance of DFP arithmetic, the IEEE 754-2008 standard for floating-point arithmetic includes it in its specifications. The basic decimal arithmetic unit, such as decimal adder, subtracter, multiplier, divider or square-root unit, as a main part of a decimal microprocessor, is attracting more and more researchers' attentions. Recently, the decimal-encoded formats and DFP arithmetic units have been implemented in IBM's system z900, POWER6, and z10 microprocessors. Increasing chip densities and transistor count provide more room for designers to add more essential functions on application domains into upcoming microprocessors. Decimal transcendental functions, such as DFP logarithm, antilogarithm, exponential, reciprocal and trigonometric, etc, as useful arithmetic operations in many areas of science and engineering, has been specified as the recommended arithmetic in the IEEE 754-2008 standard. Thus, virtually all the computing systems that are compliant with the IEEE 754-2008 standard could include a DFP mathematical library providing transcendental function computation. Based on the development of basic decimal arithmetic units, more complex DFP transcendental arithmetic will be the next building blocks in microprocessors. In this dissertation, we researched and developed several new decimal algorithms and architectures for the DFP transcendental function computation. These designs are composed of several different methods: 1) the decimal transcendental function computation based on the table-based first-order polynomial approximation method; 2) DFP logarithmic and antilogarithmic converters based on the decimal digit-recurrence algorithm with selection by rounding; 3) a decimal reciprocal unit using the efficient table look-up based on Newton-Raphson iterations; and 4) a first radix-100 division unit based on the non-restoring algorithm with pre-scaling method. Most decimal algorithms and architectures for the DFP transcendental function computation developed in this dissertation have been the first attempt to analyze and implement the DFP transcendental arithmetic in order to achieve faithful results of DFP operands, specified in IEEE 754-2008. To help researchers evaluate the hardware performance of DFP transcendental arithmetic units, the proposed architectures based on the different methods are modeled, verified and synthesized using FPGAs or with CMOS standard cells libraries in ASIC. Some of implementation results are compared with those of the binary radix-16 logarithmic and exponential converters; recent developed high performance decimal CORDIC based architecture; and Intel's DFP transcendental function computation software library. The comparison results show that the proposed architectures have significant speed-up in contrast to the above designs in terms of the latency. The algorithms and architectures developed in this dissertation provide a useful starting point for future hardware-oriented DFP transcendental function computation researches

    High sample-rate Givens rotations for recursive least squares

    Get PDF
    The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

    An Architecture for Improving Variable Radix Real and Complex Division Using Recurrence Division

    Get PDF
    International audienceThis paper shows the details of an implementation of variable radix floating-point complex division based on previous implementations of the algorithm. This implementation takes advantage of the easier prescaling offered by low-radix division and recodes it as necessary for higher radix iterations throughout the design. This, along with proper use of redundant digit sets, allows us to significantly altar performance characteristics relative to exclusively high-radix division implementations. Comparisons to existing architectures are shown, as well as common implementation optimizations for future iterations. Results are given in cmos32soi 32nm MTCMOS technology using ARMbased standard-cells and commercial EDA toolsets

    Fast decimal floating-point division

    Get PDF
    A new implementation for decimal floating-point (DFP) division is introduced. The algorithm is based on high-radix SRT division The SRT division algorithm is named after D. Sweeney, J. E. Robertson, and T. D. Tocher. with the recurrence in a new decimal signed-digit format. Quotient digits are selected using comparison multiples, where the magnitude of the quotient digit is calculated by comparing the truncated partial remainder with limited precision multiples of the divisor. The sign is determined concurrently by investigating the polarity of the truncated partial remainder. A timing evaluation using a logic synthesis shows a significant decrease in the division execution time in contrast with one of the fastest DFP dividers reported in the open literatureHooman Nikmehr, Braden Phillips and Cheng-Chew Li

    Comparison of logarithmic and floating-point number systems implemented on Xilinx Virtex-II field-programmable gate arrays

    Get PDF
    The aim of this thesis is to compare the implementation of parameterisable LNS (logarithmic number system) and floating-point high dynamic range number systems on FPGA. The Virtex/Virtex-II range of FPGAs from Xilinx, which are the most popular FPGA technology, are used to implement the designs. The study focuses on using the low level primitives of the technology in an efficient way and so initially the design issues in implementing fixed-point operators are considered. The four basic operations of addition, multiplication, division and square root are considered. Carry- free adders, ripple-carry adders, parallel multipliers and digit recurrence division and square root are discussed. The floating-point operators use the word format and exceptions as described by the IEEE std-754. A dual-path adder implementation is described in detail, as are floating-point multiplier, divider and square root components. Results and comparisons with other works are given. The efficient implementation of function evaluation methods is considered next. An overview of current FPGA methods is given and a new piecewise polynomial implementation using the Taylor series is presented and compared with other designs in the literature. In the next section the LNS word format, accuracy and exceptions are described and two new LNS addition/subtraction function approximations are described. The algorithms for performing multiplication, division and powering in the LNS domain are also described and are compared with other designs in the open literature. Parameterisable conversion algorithms to convert to/from the fixed-point domain from/to the LNS and floating-point domain are described and implementation results given. In the next chapter MATLAB bit-true software models are given that have the exact functionality as the hardware models. The interfaces of the models are given and a serial communication system to perform low speed system tests is described. A comparison of the LNS and floating-point number systems in terms of area and delay is given. Different functions implemented in LNS and floating-point arithmetic are also compared and conclusions are drawn. The results show that when the LNS is implemented with a 6-bit or less characteristic it is superior to floating-point. However, for larger characteristic lengths the floating-point system is more efficient due to the delay and exponential area increase of the LNS addition operator. The LNS is beneficial for larger characteristics than 6-bits only for specialist applications that require a high portion of division, multiplication, square root, powering operations and few additions

    Decimal Floating-point Fused Multiply Add with Redundant Number Systems

    Get PDF
    The IEEE standard of decimal floating-point arithmetic was officially released in 2008. The new decimal floating-point (DFP) format and arithmetic can be applied to remedy the conversion error caused by representing decimal floating-point numbers in binary floating-point format and to improve the computing performance of the decimal processing in commercial and financial applications. Nowadays, many architectures and algorithms of individual arithmetic functions for decimal floating-point numbers are proposed and investigated (e.g., addition, multiplication, division, and square root). However, because of the less efficiency of representing decimal number in binary devices, the area consumption and performance of the DFP arithmetic units are not comparable with the binary counterparts. IBM proposed a binary fused multiply-add (FMA) function in the POWER series of processors in order to improve the performance of floating-point computations and to reduce the complexity of hardware design in reduced instruction set computing (RISC) systems. Such an instruction also has been approved to be suitable for efficiently implementing not only stand-alone addition and multiplication, but also division, square root, and other transcendental functions. Additionally, unconventional number systems including digit sets and encodings have displayed advantages on performance and area efficiency in many applications of computer arithmetic. In this research, by analyzing the typical binary floating-point FMA designs and the design strategy of unconventional number systems, ``a high performance decimal floating-point fused multiply-add (DFMA) with redundant internal encodings" was proposed. First, the fixed-point components inside the DFMA (i.e., addition and multiplication) were studied and investigated as the basis of the FMA architecture. The specific number systems were also applied to improve the basic decimal fixed-point arithmetic. The superiority of redundant number systems in stand-alone decimal fixed-point addition and multiplication has been proved by the synthesis results. Afterwards, a new DFMA architecture which exploits the specific redundant internal operands was proposed. Overall, the specific number system improved, not only the efficiency of the fixed-point addition and multiplication inside the FMA, but also the architecture and algorithms to build up the FMA itself. The functional division, square root, reciprocal, reciprocal square root, and many other functions, which exploit the Newton's or other similar methods, can benefit from the proposed DFMA architecture. With few necessary on-chip memory devices (e.g., Look-up tables) or even only software routines, these functions can be implemented on the basis of the hardwired FMA function. Therefore, the proposed DFMA can be implemented on chip solely as a key component to reduce the hardware cost. Additionally, our research on the decimal arithmetic with unconventional number systems expands the way of performing other high-performance decimal arithmetic (e.g., stand-alone division and square root) upon the basic binary devices (i.e., AND gate, OR gate, and binary full adder). The proposed techniques are also expected to be helpful to other non-binary based applications

    Improving Goldschmidt Division, Square Root and Square Root Reciprocal

    Get PDF
    The aim of this paper is to accelerate division, square root and square root reciprocal computations, when Goldschmidt method is used on a pipelined multiplier. This is done by replacing the last iteration by the addition of a correcting term that can be looked up during the early iterations. We describe several variants of the Goldschmidt algorithm assuming 4-cycle pipelined multiplier and discuss obtained number of cycles and error achieved. Extensions to other than 4-cycle multipliers are given.Le but de cet article est l'accélération de la division, et du calcul de racines carrées et d'inverses de racines carrées lorsque la méthode de Goldschmidt est utilisée sur un multiplieur pipe-line. Nous faisons ceci en remplaçant la dernière itération par l'addition d'un terme de correction qui peut être déduit d'une lecture de table effectuée lors des premières itérations. Nous décrivons plusieurs variantes de l'algorithme obtenu en supposant un multiplieur à 4 étages de pipe-line, et donnons pour chaque variante l'erreur obtenue et le nombre de cycles de calcul. Des extensions de ce travail à des multiplieurs dont le nombre d'étages est différent sont présentées

    IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider

    Get PDF
    Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users. In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding. The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW
    corecore