Abstract-This paper presents a dynamically configurable and area-efficient multi-precision architecture for Floating Point (FP) division. FP division is a core arithmetic in scientific and engineering domain. We propose an architecture for double precision (DP) division which is also capable of processing dual (two-parallel) single precision (SP) computation, named as DPdSP FP divider. The architecture is based on series expansion methodology of computing division. Key components involved in the floating point division architecture are re-designed in order to efficiently enable the resource sharing and tune the data-path for processing both precision operands with minimum hardware overhead. We have targeted the proposed architecture using "OSUcells Cell Library" 0.18μm technology ASIC implementation. Compared to a standalone double precision divider, the proposed dual mode unified architecture needs ≈ 7% extra hardware, with ≈ 5% delay overhead. When compared to the previous work in literature, the proposed dual mode architecture out-perform them in terms of required area, throughput, and area × delay; has smaller area & delay overhead over only DP divider, and has more computational support.
I. INTRODUCTION
Floating point division is a core arithmetic needed in a multitude of scientific and engineering computations. The hardware complexity of floating point division arithmetic is more than the other basic arithmetic operations (adder, subtractor and multiplier), and it requires larger area while achieving relatively lower performance. In view of the large area requirement of division arithmetic per unit computation, we aim for a unified & dynamically configurable, multiprecision architecture for this computation.
Researchers have proposed several multi-precision floating point arithmetic architecture designs, which have mainly focused on multipliers ( [1] , [2] , [3] ) and adders ( [4] , [5] , [6] ). The only multi-precision divider for quadruple and dual double precision operands, based on radix-4 SRT (digit recurrence) division method, has been proposed by Isseven et. al. [7] . It is an iterative architecture with a throughput of 29 clock cycles; which can support only normal operands, and sub-normal have been treated as zero. Other division methodology are based on the multiplicative (Newton-Raphson, Goldscmidts) and approximation techniques [8] . All these methods estimate significant trade-offs in required area and delay, and their suitability is based on the operand size, required precision, and implementation platform (software, hardware (FPGA/ASIC)). This paper has proposed an architecture for division arithmetic which can be dynamically configured to be used for either a double precision operand or two-parallel (dual) single precision operands, called as DPdSP division architecture. Our design is based on the series expansion method (approximation technique) of division ( [9] , [10] , [11] ). The proposed DPdSP architecture supports normal as well as sub-normal operands, with round-to-nearest rounding method. A design with only normal support has also been implemented for the purpose of comparison with earlier available method in literature. Further, a DP only design, with same state-of-the-art data path flow, has also been implemented for the area & delay overhead measurements. All the implemented designs take care of corner cases, like infinity, divide-by-zero, zero.
The main contributions of this work can be summarized as follows:
• Proposed an architecture for DPdSP division, with both, normal & sub-normal support, with all the exceptional case handling. Major components (like leading-onedetection, dynamic left & right shifting, mantissa computation, rounding, etc) have been optimized/configured with tuned data path, to minimize the resource overhead.
• To the best of our knowledge, this is the only known multiplicative-based, dynamically configurable, on-the-fly multi-precision supported floating point division architecture. The architecture supports the processing of required corner cases.
• Compared to previous literature work, the proposed work has smaller area & delay overhead over only DP divider; has better area, throughput, and area × delay metric, and has more computational support.
II. BACKGROUND
Floating point arithmetic implementation involves computing separately the sign, exponent and mantissa part of the operands, and further combining them after rounding and normalization [12] , [13] . A basic state-of-the-art flow of the floating point division (including sub-normal processing) is given below in Algorithm 1.
In the present work, we have followed all the steps described in Algorithm 1 for the implementation of the proposed DPdSP division architecture. Each stage of the architecture has been constructed for the support of the dual precision arithmetic. 
III. PROPOSED DPDSP DIVISION ARCHITECTURE
The proposed floating point division architecture for double precision with dual (two-parallel) single precision support (DPdSP) is shown in Fig. 1 . Two 64-bit input operands, may contains either 1-set of double precision or 2-set of single precision operands. All the computational steps in dual mode is discussed below in details.
A. Data Extraction, Sub-normal and Exceptional Handler
In this part, the sign, exponent and mantissa of single or double precision operands have been extracted from the input operands, according to the floating point formats of single and double precision as follows.
Single Precision: The operands are checked for divide-by-zero, in which DP divide-by-zero signal is obtained by ANDing both SP divide-by-zero signal with SP-1 sign-bit. Similar check is done for zero. Then the exponents are checked for sub-normal conditions and updated along with relevant mantissa. Since, 8-bit exponents of DP and second SP overlapped, their subnormal checks have been shared to save resources. Similarly, the checks for infinity and nan has been shared among DP and SP. In this part, compared to only double precision, we need extra resources for the checks on first single precision operands.
B. Sub-normal Processing
This section of the architecture, either processes the mantissas of DP or the mantissas for the two SP's. The sub-normal processing of input mantissa, first includes the leading-onedetector (LOD) which detects the position of the leading one, as a left-shift amount for relevant mantissa. Later, it requires a dynamic left shifting of mantissa to bring them in normalized format. 
Rounding -> f(guard-bit, round-bit, sticky-bit) Compute -> dp-ULP, sp2-ULP, sp1-ULP ULP = dp_sp ? {dp-ULP} : {sp2-ULP,sp1-ULP} div_M_rounded = div_M_S + ULP Exp & Mantisa Update for Underflow, Overflow, Exceptional dp: Double Precision sp: Single Precision The architecture of a dual mode leading-one-detector is shown in Fig. 2 . The basic building block LOD2:1, consists of three gates which used in a hierarchical manner to get LOD64:6. The output of sub-units of LOD64:6 (the two LOD32:5) are taken as the left shift amount for two SP's, whereas the combined result of them is the left shift amount for the DP mantissa. The resource requirement of dual mode LOD is same as the single mode (for DP only) LOD.
The architecture of dual mode dynamic left shifter is shown 
C. Sign, Exponent and Right Shift Computation
The sign and exponent computation are trivial and simpler. In case of underflow-ed (negatively computed) exponent, the right shift amount need to be computed for the right shifting of quotient mantissa. All the relevant computation of this section have been done separately for DP and both SP operands, as shown in Fig. 1 .
D. Mantissa Computation
The mantissa computation is the most critical part in division architecture. This has been implemented using series expansion method as follows. in << 32 dp [5] [ Stage-1 = f(dp [4] , sp2 [4] , sp1 [4] )
Stage-2 = f(dp [3] , sp2 [3] , sp1 [3] )
Stage-3 = f(dp [2] , sp2 [2] , sp1 [2] )
Stage-4 = f(dp [1] Let x represent the dividend mantissa, y represent divisor mantissa and q be the mantissa quotient, which can be computed as follows,
where, the divisor mantissa has been partitioned to a 1 and a 2 , which can be written as follows,
Here, the precomputed value of a −1
1 is used to perform the remaining computation. Based on the bit width (m1) of a 1 , the size of memory (to store a −1 1 ) and the number of terms from the series expansion can be decided. For a balanced case, the value m1 = 8 − bit is a preferred choice for double precision computation. With m1 = 8 − bit, it require seven terms (up to a −7
1 .a 6 2 ) for double precision and three terms (up to a −3
1 .a 2 2 ) for single precision computation. The quotient expression can be written as follows.
For double precision it will be,
For single precision it will be,
In both the eqs. (3 and 4) , simplification has been done to achieve the maximum overlapping of the terms. Here, eq.(4) can be fully superimposed on eq.(3), and both can share the 1 − W sp1 i dp sp2 i a −1 1 ← dp sp?{1 b0, dp sp2 i[52 : 0]} : {3 b0, dp sp2 i[52 : 29], 3 b0, sp1 i} a 2 ← dp sp?dp m2 x → dp sp? {1 b0, dp m1} : {3 b0, sp2 m1, 3 b0, sp1 m1} y → m2 → (a 1 + a 2 ) dp m1[52 : 0] dp o sp o 1 .a 4 2 , α and β blocks takes part in DP computation only. The computational flow is as follows.
Initially, in step-1 the precomputed value of a −1 1 has been obtained from look-up-table (LUTs). This architecture used two LUTs; one targeted either for DP or SP-2 operand (of size 258x53), and other for SP-1 operand (of size 256x24). First LUT has been shared for DP and SP-2, based on the respective computation. The a −1 1 value is determined using a multiplexer (mux), either for DP or for both SP's.
In step-2, the computation of x.a Fig. 5 ), which can perform either a 54x54 multiplication (for DP) or two 27x27 multiplication (for both SP). Here, the input operands size is 54-bits, which either contains DP data or both of SP's data. The multiplication has been performed using Karatsuba method [14] . Using Karatsuba method, the multiplication logic reduced by 25% incurring an additional cost of some adders and q subtractor. This helps in reducing the area. Computation of x.a −1 1 has been performed as in eq.(5). For a 54x54 multiplication, it requires two 27x27 multipliers, one 28x28 multiplier, two 27-bit adders, one 56-bit subtractor and one 91-bit adder. This dual mode multiplication does not need any extra hardware cost over only DP, except for an additional mux. 
Similarly, the a −1
1 .a 2 has been computed using a 54x44_Dual_27x22 Mult, which performs either a 54x44 multiplication (for DP) or two 27x22 multiplication (for both SP's), using the Karatsuba method, as in eq.(6). Here, a 2 is 44-bit wide and contains data as shown in Fig. 4 . 
Step-3 computes a −2
1 .a 2 2 with a dual mode square, 54x54_Dual_27x27 Square. This has been done on block basis, with block size of 27-bit, needs three 27x27 multiplier and one 90-bit adder as follows: 
Step-4 needs a 34-bit square to compute a −4
1 .a 4 2 . This needed only for DP flow and is implemented using block basis (with block size of 17-bit) similar to eq.(7). Since the contribution of this term in DP result falls after 34-bit precision, a smaller multiplier is required. This step also computes a −1
2 , using two 27-bit subtractors (for both SP's), combined to form a 54-bit subtractor (for DP).
Step-5 and step-6 performs computation related to DP only.
Step-5 computes the last sum (1 + a −2
1 .a 4 2 ) term of eq.(3), using a 54-bit adder. Whereas, step-6 computes the multiplication of previous sum (α) with DP output components of Z (in step-4), using a 54x54 multiplier using Karatsuba method.
Step-7 computes the dual multiplication using a 54x54_Dual_27x27 multiplier (as in Fig. 5 ). It multiplies x.a −1 1 with either β of step-6 or single precision outputs of Z (in step-4), to generate W (x.a −1
2 ) for both SP's). Finally, the W has been subtracted by x.a −1 1 in step-8 using two 28-bit subtractors (for both SP's), collectively performs 56-bit subtraction for DP mantissa quotient.
Thus, the proposed dual mode mantissa division architecture performs either a double precision division or two single precision division, with extra costs of a 2 8 × 24 LUT for SP-1, and multiplexers in each dual mode step.
E. Dynamic Right Shifting
The architecture of dual mode dynamic right shifter is shown in Fig. 6 . The input to this unit are mantissa quotient, and computed right shift amount. The underlying concept of it's architecture is similar to the dual mode dynamic left shifter. In comparison to left shifter, the additional multiplexer in stage-1 to stage-5 is used to process the lower right shift output or its combination with primary input of the stage.
F. Normalization, Rounding and Final Processing
This section processes the previously computed exponents and dual mantissa division result to obtain rounded normalized format. The output of dual mode mantissa division either consists of DP mantissa division quotient or consists of both SP mantissa division quotients in each of its 32-bit parts. Based on the MSBs of the quotient result, the rounding position is determined. Further, based on the rounding position bit, Guard-bit, Round-bit, Sticky-bit and MSB-bit, the round ULP (unit at last place) has been computed. This ULP computation has been performed separately for DP and both SP and requires few gates for each. These round ULP has later been added to mantissa quotient using two 28-bit adders, individually works for SP's computations, and collectively produce the output for DP. This rounded mantissa sum has been normalized. The rounding adder in effect is similar to that required for only DP processing.
Furthermore, the exponents have been updated accordingly. Then each exponent and mantissa is updated for one of infinity, sub-normal or underflow cases, and each require separate units. Stage-1 = f(dp [4] , sp2 [4] , sp1 [4] ) Stage-2 = f(dp [3] , sp2 [3] , sp1 [3] ) Stage-3 = f(dp [2] , sp2 [2] , sp1 [2] ) Stage-4 = f(dp [1] , sp2 [1] The computed signs, exponents and mantissas of double precision and both single precision have been finally multiplexed to produce the final 64-bit output, which either contains a DP output or two SP outputs. A brief summary of extra resource overhead of proposed DPdSP division architecture over only a DP division is shown in Table- I.
IV. IMPLEMENTATION RESULTS
This section presents the implementation details of the proposed DPdSP divider architecture along with only DP divider implementation. The proposed DPdSP architecture has been synthesized with "OSUcells Cell [15]" 0.18μm technology, using Synopsys Design Compiler. The proposed architecture is currently aimed towards the single cycle design. The proposed DPdSP divider architecture has been synthesized with only normal support and with sub-normal support. A DP only division design, with same state-of-the-art data path flow, has also been synthesized (with both, only normal and with sub-normal) for area & delay overhead computation purpose. The implementation details has been shown in Table- 5 Throughput (in FO4) = Throughput (in cycle) * Period (FO4) 6 Gate Count × Throughput (FO4) the proposed DPdSP division architecture requires ≈ 4.3% more hardware and ≈ 5% more delay than only DP division design, and it needs ≈ 6.9% more resources and ≈ 3.6% more delay when included with sub-normal support.
A comparison with previously reported dual-mode (double precision with two-parallel single precision support) division design in literature has been shown in Table III . To the best of our knowledge there is no prior work on the related title using multiplicative methods of division. As reported earlier, Isseven et. al. has proposed an iterative dual-mode division architecture using radix-4 SRT division algorithm, a digitrecurrence method. Their proposed design has a throughput of 29 clock cycle for double precision, and 15 clock cycle for single precision. Further, it has been proposed for only normal support and without sub-normal support. They have synthesized their proposal using the TSMC 0.25μm technology. Table III has shown a comparison of the proposed work with their work. We have made a technology independent comparison, where, the area has been compared in terms gate count (based on minimum size inverter) and delay/throughput has been compared in terms of FO4 (Fan Out of 4) delay. Comparatively, Isseven et. al.'s dual-mode architecture needs a larger area than the proposed DPdSP architecture. The throughput for double precision processing is much better for the proposed design. For single precision, the throughput of Isseven et. al. is 15 clock cycle, an equivalently 471 FO4, is also larger than the proposed work. The area × delay of the proposed architecture is much better than the previous work. The proposed architecture can be easily pipelined to have a even better throughput. Also, the area & delay overhead, over only DP divider, of proposed work is smaller than the previous work in Isseven et. al.. Also, in addition to better design metrics, the proposed work also support the processing of sub-normal operands.
V. CONCLUSIONS This paper has presented an architecture for floating point division with on-the-fly dual precision support. It supports configurable, double precision with dual (two-parallel) single precision (DPdSP) floating point division computation. It support normal and sub-normal operands processing. The data path of the architecture has been tuned to perform the dual mode computation with minimal hardware overhead. The crucial module of mantissa division has been tuned with other components (dual mode LOD, dual mode dynamic left/right shifter, rounding) for on-the-fly dual mode computation. It has ≈ 4.3% − 6.9% area and ≈ 5% delay overhead over DP only module. The proposed division architecture is the only known multiplicative based, dynamically configurable, on-thefly multi-precision supported design. Compared to previous literature, this work out-perform them in terms of required area, throughput, and area × delay; has smaller area & delay overhead over only DP divider, with more computational support. Future work will focus on configurable architecture for other division methods (Newton-Raphson, Goldscmidts).
