Abstract-Hardware support for floating-point (FP) arithmetic is a mandatory feature of modern microprocessor design. Although division and square root are relatively infrequent operations in traditional general-purpose applications, they are indispensable and becoming increasingly important in many modern applications. Therefore, overall performance can be greatly affected by the algorithms and the implementations used for designing FP-div and FP-sqrt units. In this paper, a fused floating-point multiply/divide/square root unit based on Taylor-series expansion algorithm is proposed. We extended an existing multiply/divide fused unit to incorporate the square root function with little area and latency overhead since Taylor's theorem enables us to compute approximations for many well-known functions with very similar forms. The proposed arithmetic unit exhibits a reasonably good areaperformance balance.
INTRODUCTION
Due to constant advances in VLSI technology and the prevalence of business, technical, and recreational applications that use floating-point operations, floating-point computational logic has long been an essential component of high-performance computer systems as well as embedded systems and mobile applications. The performance of many modern applications which have a high frequency of floating-point operations is often limited by the speed of the floating-point hardware. Therefore, a high-performance FPU is an essential component of these systems. Over the past years, leading architectures have incorporated several generations of FPUs. However, while addition and multiplication implementations have become increasingly efficient, support for division and other elementary functions such as square root has remained uneven [1] .
Division has long been considered a minor, bothersome member of the floating-point family. Hardware designers frequently perceive divisions as infrequent, low-priority operations, and they allocate design effort and chip resources accordingly, as addition and multiplication require from two to five machine cycles, while division latencies range from nine to sixty as shown in Table I . The variation is even greater for square root. Although division and square root are relatively infrequent operations in traditional general-purpose applications, they are indispensable and becoming increasingly important, particularly in many modern applications such as CAD tools and 3D graphics rendering. Furthermore, due to the latency gap between addition/multiplication and division/square root, the latter operations increasingly become performance bottlenecks.
This research was supported by DARPA Therefore, poor implementations of floating-point division and square root can result in severe performance degradation.
The remainder of this paper is organized as follows. Section 2 presents a brief description of existing algorithms used for floating-point division and square root operations followed by a detailed description of the proposed floatingpoint multiply/divide/square root fused unit in Section 3. Section 4 presents comparison results followed by a brief summary and conclusion in Section 5.
II. DIVISION / SQUARE ROOT ALGORITHMS
The long latency of division and square root operations is mainly due to the algorithms used for these operations. Typically, First-order Newton-Raphson algorithms and binomial expansion algorithms (Goldschmidt's) are commonly used for division and square root implementations in high-performance systems. These algorithms exhibit better performance than subtractive algorithms, and an existing floating-point multiplier can be shared among iterative multiply operations in the algorithms to reduce area. Even so, it still results in a long latency to compute one operation, and the subsequent operation cannot be started until the previous operation finishes since the multiplier used in these algorithms is occupied by the several multiply operations of the algorithms. Liddicoat and Flynn [2] proposed a multiplicative division algorithm based on Taylor-series expansion as shown in Fig. 2 . This algorithm achieves fast computation by using parallel powering units such as squaring and cubing units, which compute the higher-order terms significantly faster than traditional multipliers with a relatively small hardware overhead. Fig. 3 . A cubing unit can be designed in a similar fashion. Further area and latency optimization can be achieved by truncating the least significant bits of these partial product terms while satisfying the required precision of floating-point operation. Specifically, 69.6% of the partial product terms from the squaring unit and 97.6% of the partial product terms from the cubing unit were truncated. As a result, the proposed squaring unit achieves an area reduction of 66.8%, and the proposed cubing unit achieves an area reduction of 89.9%, as compared with a traditional multiplier.
III. PROPOSED FP-MUL/DIV/SQRT FUSED UNIT
There are three major multiply operations in the Taylorseries expansion algorithm with powering units to produce a quotient with 0.5 ulp (unit in the last place) error as shown in Fig. 2 . One additional multiply operation is required for exact rounding to generate IEEE-754 floating-point standard compliant results. Even though the Taylor-series expansion algorithm with powering units exhibits the highest performance among multiplicative algorithms, it consumes a larger area because the architecture consists of four multipliers, which is not suitable for area-critical applications. In earlier work, we presented a fused floating-point multiplydivide unit based on Taylor-series expansion with powering units where all multiply operations are executed by one multiplier to maximize the area efficiency, while achieving high performance by using a pipelined architecture [5] [6] . By sharing the 2-stage pipelined multiplier among the multiply operations in the algorithm, the latency becomes longer (12 clock cycles) than the direct implementation of the original algorithm (8 clock cycles). However, through careful pipeline scheduling, we were able to achieve a moderately high throughput (one completion every 5 clock cycles) for consecutive divide instructions and 1.6 times smaller area. The area difference between the proposed arithmetic unit and the direct implementation of the original algorithm is mainly because of the area occupied by multipliers.
Approximations for many well-known functions can be computed by using Taylor's theorem. Therefore, we can extend the existing fused Mul/Div unit to other elementary functions, such as square root, which is a common operation required in many modern multimedia applications such as graphics engines. The Taylor-series approximations of such elementary functions have very similar forms, and the only difference is the coefficients of each term in the polynomial as shown in equations (1) and (3). The polynomial coefficients for Taylor-series approximations are typically very low in complexity and often the coefficient multiplications can be computed using multiplexers since a simple 1-bit right shift will provide a 1/2 multiple. Therefore, extending the existing fused unit to incorporate a square root function can be achieved with little area overhead as the additional hardware components will be two lookup tables for the initial seed value of square root (odd and even operands), which is also common for other algorithms, and three multiplexers. The latency and throughput will remain the same since adding muxes to the existing datapath will incur negligible delay. A block diagram of the proposed Mul/Div/Sqrt fused arithmetic unit is shown in Fig.4 and the steps required for a division and a square root operation are described in Table 2 . 
The proposed fused arithmetic unit has a 5-stage pipelined architecture and the latency of fp-multiply operation is 5 cycles, which can be fully pipelined. The pipeline diagram of one division operation and an example of mixed fp-mul, fp-div and fp-sqrt operations are shown in Fig. 5 and Fig.6 respectively. 
IV. COMPARISON RESULTS
The comparison of other commonly used multiplicative algorithms and the proposed arithmetic unit for singleprecision floating-point division is summarized in the following table, where an 8-bit initial seed value and a 2-stage pipelined multiplier are used for a realistic implementation. Division algorithms are commonly used for square root implementations. However, the latency of square root operation is usually longer than that of division operations. A brief survey of several floating-point dividers in leading microprocessors is summarized in Table 4 [7] [8] . As shown in the table, the division algorithm used in each of these architectures is one of the basic algorithms, which requires many cycles to compute the result. These characteristics may be acceptable in some applications like typical desktop applications. However, high performance is crucial for many applications such as scientific computation, CAD tools and 3D graphics rendering which have a higher frequency of floating-point division and square root operations. It is very difficult to compare the designs where different design methodologies were used, such as full custom and standard cell approach. Generally, full custom requires 10x the design cost for 2x more speed and 1/2 the area as compared with standard cell methodology. As described earlier, the latency and the throughput of division and square root operations are worse than the direct implementation by sharing a multiplier through a non-linear pipeline scheme. However, considering the logic style and the fabrication technology, the performance of the proposed divider is better than the floating-point dividers in leading architectures, especially in terms of throughput. In terms of area, the implementations based on SRT algorithm may be smaller than multiplication-based implementations when a small radix is used. However, these algorithms require more cycles to compute the quotient with the required precision. As shown in Table 3 , the area of the binomial expansion algorithm (Goldschmit's) is almost the same as the direct implementation of Liddicoat's algorithm. Therefore, the proposed arithmetic unit also exhibits an area-efficiency as compared to an implementation of Goldschmit's algorithm.
V. CONCLUSION
This paper presents a design plan for a fused floatingpoint multiply/divide/square root unit based on the Taylorseries expansion algorithm. The square root function can be easily incorporated into an existing multiply/divide fused unit with very minimal area and latency overhead due to the similarity of Taylor-series approximations. The resulting arithmetic unit exhibits an area efficiency as compared to implementations of other commonly used division/square root algorithms as well as high throughput and moderate latency as compared with other FPU implementations of leading architectures. We are currently in the process of implementing the fused floating-point multiply/divide/square root unit, with implementation details to follow in a future paper. However, we project that adding square root to the existing unit will increase the area by a mere 4.9%.
