Configurable Architecture for Double/Two-Parallel Single Precision Floating Point Division by Cheung, RCC et al.
Title Configurable Architecture for Double/Two-Parallel SinglePrecision Floating Point Division
Author(s) Jaiswal, MK; Cheung, RCC; Balakrishnan, M; Paul, K
Citation Proceedings of 2014 IEEE Computer Society Annual Symposiumon VLSI (ISVLSI), Tampa, FL, USA, 9-11 July 2014, p. 332-337
Issued Date 2014
URL http://hdl.handle.net/10722/249248
Rights
IEEE Computer Society Annual Symposium on VLSI. Copyright
© IEEE Computer Society.; ©2014 IEEE. Personal use of this
material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale
or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.; This work is licensed
under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License.
Configurable Architecture for Double / Two-Parallel
Single Precision Floating Point Division
Manish Kumar Jaiswal, Ray C.C. Cheung
Department of EE, City University of Hong Kong, Hong Kong
Email: manish.kj@my.cityu.edu.hk, r.cheung@cityu.edu.hk
M. Balakrishnan and Kolin Paul
Department of CSE, IIT Delhi, India
Email:{mbala, kolin}@cse.iitd.ernet.in
Abstract—This paper presents a dynamically configurable and
area-efficient multi-precision architecture for Floating Point (FP)
division. FP division is a core arithmetic in scientific and engineer-
ing domain. We propose an architecture for double precision (DP)
division which is also capable of processing dual (two-parallel)
single precision (SP) computation, named as DPdSP FP divider.
The architecture is based on series expansion methodology of
computing division. Key components involved in the floating
point division architecture are re-designed in order to efficiently
enable the resource sharing and tune the data-path for processing
both precision operands with minimum hardware overhead. We
have targeted the proposed architecture using “OSUcells Cell
Library” 0.18µm technology ASIC implementation. Compared to
a standalone double precision divider, the proposed dual mode
unified architecture needs ≈ 7% extra hardware, with ≈ 5% delay
overhead. When compared to the previous work in literature, the
proposed dual mode architecture out-perform them in terms of
required area, throughput, and area×delay; has smaller area &
delay overhead over only DP divider, and has more computational
support.
Index Terms—Floating Point Division, ASIC, Multi-precision
Arithmetic, Dynamic Configurable Computing.
I. INTRODUCTION
Floating point division is a core arithmetic needed in a
multitude of scientific and engineering computations. The
hardware complexity of floating point division arithmetic
is more than the other basic arithmetic operations (adder,
subtractor and multiplier), and it requires larger area while
achieving relatively lower performance. In view of the large
area requirement of division arithmetic per unit computation,
we aim for a unified & dynamically configurable, multi-
precision architecture for this computation.
Researchers have proposed several multi-precision floating
point arithmetic architecture designs, which have mainly fo-
cused on multipliers ([1], [2], [3]) and adders ([4], [5], [6]).
The only multi-precision divider for quadruple and dual double
precision operands, based on radix-4 SRT (digit recurrence)
division method, has been proposed by Isseven et. al. [7]. It is
an iterative architecture with a throughput of 29 clock cycles;
which can support only normal operands, and sub-normal
have been treated as zero. Other division methodology are
based on the multiplicative (Newton-Raphson, Goldscmidts)
and approximation techniques [8]. All these methods estimate
significant trade-offs in required area and delay, and their
suitability is based on the operand size, required precision, and
implementation platform (software, hardware (FPGA/ASIC)).
This paper has proposed an architecture for division arith-
metic which can be dynamically configured to be used for
either a double precision operand or two-parallel (dual) single
precision operands, called as DPdSP division architecture. Our
design is based on the series expansion method (approximation
technique) of division ([9], [10], [11]). The proposed DPdSP
architecture supports normal as well as sub-normal operands,
with round-to-nearest rounding method. A design with only
normal support has also been implemented for the purpose
of comparison with earlier available method in literature.
Further, a DP only design, with same state-of-the-art data
path flow, has also been implemented for the area & delay
overhead measurements. All the implemented designs take
care of corner cases, like infinity, divide-by-zero, zero.
The main contributions of this work can be summarized as
follows:
• Proposed an architecture for DPdSP division, with both,
normal & sub-normal support, with all the exceptional
case handling. Major components (like leading-one-
detection, dynamic left & right shifting, mantissa com-
putation, rounding, etc) have been optimized/configured
with tuned data path, to minimize the resource overhead.
• To the best of our knowledge, this is the only known
multiplicative-based, dynamically configurable, on-the-fly
multi-precision supported floating point division architec-
ture. The architecture supports the processing of required
corner cases.
• Compared to previous literature work, the proposed work
has smaller area & delay overhead over only DP divider;
has better area, throughput, and area×delay metric, and
has more computational support.
II. BACKGROUND
Floating point arithmetic implementation involves comput-
ing separately the sign, exponent and mantissa part of the
operands, and further combining them after rounding and
normalization [12], [13]. A basic state-of-the-art flow of the
floating point division (including sub-normal processing) is
given below in Algorithm 1.
In the present work, we have followed all the steps described
in Algorithm 1 for the implementation of the proposed DPdSP
division architecture. Each stage of the architecture has been
constructed for the support of the dual precision arithmetic.
Algorithm 1 F.P. Division Computational Flow [12], [13]
1: (IN1 (Dividend), IN2 (Divisor)) Input Operands;
2: Data Extraction & Exceptional Check-up:
{S1(Sign1), E1(Exponent1), M1(Mantissa1)} ← IN1
{S2, E2, M2} ← IN2
Check for Infinity, Sub-Normal, Zero, Divide-By-Zero
3: Process both Mantissa for Sub-Normal:
Leading One Detection of both Mantissa
Dynamic Left Shifting of both Mantissa
4: Sign, Exponent & Right-Shift Computation:
S← S1⊕S2
E ← (E1−L_Shi f t_E1)− (E2−L_Shi f t_E2)+BIAS
R_Shi f t ← (E2−L_Shi f t_E2)− (E1−L_Shi f t_E1)−BIAS
5: Mantissa Computation: M←M1/M2
6: Dynamic Right Shifting of Quotient Mantissa
7: Normalization & Rounding:
Determine Correct Rounding Position
Compute ULP using Guard, Round & Sticky Bit
Compute M←M+ULP
1-bit Right Shift Mantissa in Case of Mantissa Overflow
Update Exponent
8: Finalizing Output:
Determine STATUS signal & Check Exceptional Cases
Determine Final Output
III. PROPOSED DPDSP DIVISION ARCHITECTURE
The proposed floating point division architecture for double
precision with dual (two-parallel) single precision support
(DPdSP) is shown in Fig. 1. Two 64-bit input operands, may
contains either 1-set of double precision or 2-set of single
precision operands. All the computational steps in dual mode
is discussed below in details.
A. Data Extraction, Sub-normal and Exceptional Handler
In this part, the sign, exponent and mantissa of single or
double precision operands have been extracted from the input
operands, according to the floating point formats of single and
double precision as follows.
Single Precision:
1−bit
︷ ︸︸ ︷
Sign−bit
8−bit
︷ ︸︸ ︷
exponent
23−bit
︷ ︸︸ ︷
mantissa
Double Precision:
1−bit
︷ ︸︸ ︷
Sign−bit
11−bit
︷ ︸︸ ︷
exponent
52−bit
︷ ︸︸ ︷
mantissa
The operands are checked for divide-by-zero, in which
DP divide-by-zero signal is obtained by ANDing both SP
divide-by-zero signal with SP-1 sign-bit. Similar check is
done for zero. Then the exponents are checked for sub-normal
conditions and updated along with relevant mantissa. Since,
8-bit exponents of DP and second SP overlapped, their sub-
normal checks have been shared to save resources. Similarly,
the checks for infinity and nan has been shared among DP
and SP. In this part, compared to only double precision, we
need extra resources for the checks on first single precision
operands.
B. Sub-normal Processing
This section of the architecture, either processes the mantis-
sas of DP or the mantissas for the two SP’s. The sub-normal
processing of input mantissa, first includes the leading-one-
detector (LOD) which detects the position of the leading one,
as a left-shift amount for relevant mantissa. Later, it requires a
dynamic left shifting of mantissa to bring them in normalized
format.
Normalization & Rounding
Final Output (64-bit)
      Dual Mantissa 
Division Architecture
dp_sp
dp_sp
div_M
LOD 32-bitLOD 32-bit
dp_m1 {sp2_m1, sp1_m1}
dp_sp
Dual Dynamic Left Shift
m1
m1
dp_sp
LOD 32-bitLOD 32-bit
dp_sp
Dual Dynamic Left Shift
dp_sp
{sp2_m2, sp1_m2}dp_m2
m2
m2
M1M2
sp1_s = sp1_s1 ^ sp1_s2
dp_s = sp2_s = sp2_s1 ^ sp2_s2
sp1_e = 7’h7F + (sp1_e1-sp1_ls1) - (sp1_e2-sp1_ls2)
sp2_e = 7’h7F + (sp2_e1-sp2_ls1) - (sp2_e2-sp2_ls2)
dp_e = 10’h3FF + (dp_e1-dp_ls1) - (dp_e2-dp_ls2)
sp2_ls2 dp_ls2 sp1_ls2 sp2_ls1 dp_ls1 sp1_ls1
sp1_rs = (sp1_e2-sp1_ls2) - 7’h7F - (sp1_e1-sp1_ls1)
sp2_rs = (sp2_e2-sp2_ls2) - 7’h7F - (sp2_e1-sp2_ls1)
dp_rs  = (dp_e2-dp_ls2) - 10’h3FF - (dp_e1-dp_ls1)
Sign, Exp & Right-Shift
Dual Dynamic Right Shift
*_rs
div_M_S
            Rounding -> f(guard-bit, round-bit, sticky-bit)
            Compute -> dp-ULP, sp2-ULP, sp1-ULP
            ULP = dp_sp ? {dp-ULP} : {sp2-ULP,sp1-ULP}
            div_M_rounded = div_M_S + ULP
Exp & Mantisa Update for Underflow, Overflow, Exceptional
dp: Double Precision
sp: Single Precision
_e : Exponent
_m: Mantissa
_s: Sign
_z: zero
_sn: sub-normal
_dbz: Divide-By-Zero
_ls: Left shift
_rs: Right Shift
dp_sp : 1 (DP processing)      : 0 (dual SP processing)
1 10 0
32-bit
23-bit23-bit8-bit 8-bit
52-bit11-bit
64-bit
DP[63:32] / SP2 DP[31:0] / SP1
Input / Output Register Format
in1 (Dividend)in2 (Divisor)
sp1_sn1=~|in1[30:23]
sp1_sn = sp1_sn1 & sp1_sn2
sp1_e1={in1[30:24],in1[23] | sp1_sn1}
sp1_m1={~sp1_sn1,in1[22:0]}
sp1_z = ~|in1[30:0]
sp1_sn2=~|in2[30:23]
sp1_s1 = in1[31],    sp1_s2 = in2[31]
sp1_e2={in2[30:24],in2[23] | sp1_sn2}
sp1_m2={~sp1_sn2,in2[22:0]}
sp1_dbz = ~|in2[30:0]
sp2_sn1=~|in1[62:55]
sp2_sn = sp1_sn1 & sp2_sn2
sp2_e1= {in1[62:56], in1[55] | sp2_sn1}
sp2_m1= {~sp2_sn1, in1[54:32]}
sp2_z = ~|in1[62:32]
sp2_sn2=~|in2[62:55]
sp2_s1= in1[63],  sp2_s2 = in2[63]
sp2_e2= {in2[62:56], in2[55] | sp2_sn2}
sp2_m2= {~sp2_sn2, in2[54:32]}
sp2_dbz = ~|in2[62:32]
Data Extraction, SubNormal & Exceptional Handler
dp_sp
dp_sn1=~|in1[54:52] & sp2_sn1
dp_sn = dp_sn1 & dp_sn2
dp_e1={in1[62:53],in1[52] | dp_sn1}
dp_m1={~dp_sn1,in1[51:0]}
dp_z = ~(sp1_z & sp2_z & ~in1[31])
dp_sn2=~|in2[54:52] & sp2_sn2
dp_s1=in1[63],    dp_s2=in2[63]
dp_e2={in2[62:53],in2[52] | dp_sn2}
dp_m2={~dp_sn2,in2[51:0]}
dp_dbz = ~(sp1_dbz & sp2_dbz & ~in2[31])
Fig. 1: DPdSP Division Architecture
The architecture of a dual mode leading-one-detector is
shown in Fig. 2. The basic building block LOD2:1, consists
of three gates which used in a hierarchical manner to get
LOD64:6. The output of sub-units of LOD64:6 (the two
LOD32:5) are taken as the left shift amount for two SP’s,
whereas the combined result of them is the left shift amount
for the DP mantissa. The resource requirement of dual mode
LOD is same as the single mode (for DP only) LOD.
The architecture of dual mode dynamic left shifter is shown
in[0]in[1]
outout_v
out_hout_hv out_lout_lv
out_v
in[15:0]in[31:16]
LOD_16:4 LOD_16:4
out[4:0]
out_hout_hv out_lout_lv
out_v
in[7:0]in[15:8]
LOD_8:3 LOD_8:3
out[3:0]
out_hout_hv out_lout_lv
out_v
in[7:4] in[3:0]
LOD_4:2 LOD_4:2
out[2:0]
in[3:2] in[1:0]
LOD_2:1
1 0
out_hout_hv
{1’b1,out_l}
{1’b0,out_h}
LOD_2:1
out_lout_lv
out[1:0]out_v
1 0
out_hout_hv
{1’b1,out_l}
{1’b0,out_h}
out_lout_lv
in[63:32] in[31:0]
LOD_32:5 LOD_32:5
dp_shift[5:0]
sp2_shift[4:0] sp1_shift[4:0]
Fig. 2: Dual Mode Leading-One-Detector
in Fig. 3. The left shift amount from the previous LOD unit
along with relevant mantissa is provided as input to this unit.
For true dp_sp the SP’s left shift amount is set to zero, and
similarly, for false dp_sp the DP left shift amount is forced
to zero. In the initial stage, it processes only the MSB of DP
left shift (with true dp_sp). All the other stages (Stage-1 to
Stage-5) work on the dual shift mode. Each of these stages
contains two multiplexers for each 32-bit blocks which shift
their inputs based on the corresponding shifting bit (either
of double or single precision). Along with this, it contains a
multiplexer which can select between lower shifting output
or their combination with primary input to the stage, based
on the true dp_sp and corresponding shifting bit of DP.
Except this multiplexer, the architecture behaves like two 32-
bit barrel shifter, which have been constructed to support dual
mode shifting operation. The proposed dual mode dynamic
left shifter has a minor area & delay overhead than a single
mode 64-bit dynamic right shifter.
C. Sign, Exponent and Right Shift Computation
The sign and exponent computation are trivial and simpler.
In case of underflow-ed (negatively computed) exponent, the
right shift amount need to be computed for the right shifting of
quotient mantissa. All the relevant computation of this section
have been done separately for DP and both SP operands, as
shown in Fig. 1.
D. Mantissa Computation
The mantissa computation is the most critical part in di-
vision architecture. This has been implemented using series
expansion method as follows.
in
01
in << 32 dp[5]
[63:0]
in[63:0]<-- {[63:0]} / {[31:0],[31:0]}, SHIFT<-- dp[5:0], sp2[4:0], sp1[4:0]
Shifted Output
Stage-1 =  f(dp[4], sp2[4], sp1[4])
Stage-2 =  f(dp[3], sp2[3], sp1[3])
Stage-3 =  f(dp[2], sp2[2], sp1[2])
Stage-4 =  f(dp[1], sp2[1], sp1[1])
Stage-5 =  f(dp[0], sp2[0], sp1[0])
[31:0]
0101
[63:32]
[31:0]
0 1
[31:0][63:32]
[63:32]
dp[x] | sp1[x]dp[x] | sp2[x]
dp_sp & dp[x]
One Stage Unit
y=2**x
[31:32-y][63:32+y]
<< y<< y
Fig. 3: Dual Mode Dynamic Left Shifter
Let x represent the dividend mantissa, y represent divisor
mantissa and q be the mantissa quotient, which can be com-
puted as follows,
q=
x
y
=
x
a1+a2
= x× (a1+a2)
−1 (1)
where, the divisor mantissa has been partitioned to a1 and a2,
which can be written as follows,
(a1+a2)
−1 = a−11 −a
−2
1 .a2+a
−3
1 .a
2
2−a
−4
1 .a
3
2+ · · · (2)
Here, the precomputed value of a−11 is used to perform the
remaining computation. Based on the bit width (m1) of a1, the
size of memory (to store a−11 ) and the number of terms from
the series expansion can be decided. For a balanced case, the
value m1 = 8−bit is a preferred choice for double precision
computation. With m1= 8−bit, it require seven terms (up to
a−71 .a
6
2) for double precision and three terms (up to a
−3
1 .a
2
2)
for single precision computation. The quotient expression can
be written as follows.
For double precision it will be,
q= x× [a−11 −a
−2
1 .a2+a
−3
1 .a
2
2−a
−4
1 .a
3
2
+a−51 .a
4
2−a
−6
1 .a
5
2+a
−7
1 .a
6
2]
= x.a−11 − x.a
−1
1 {(a
−1
1 .a2−a
−2
1 .a
2
2)
× (1+a−21 .a
2
2+a
−4
1 .a
4
2)} (3)
For single precision it will be,
q= x× [a−11 −a
−2
1 .a2+a
−3
1 .a
2
2]
= x.a−11 − x.a
−1
1 (a
−1
1 .a2−a
−2
1 .a
2
2) (4)
In both the eqs.(3 and 4), simplification has been done to
achieve the maximum overlapping of the terms. Here, eq.(4)
can be fully superimposed on eq.(3), and both can share the
256x24 LUT
SP1−RAMDP−SP2−RAM
256x53 LUT
54x44_Dual_27x22 Mult54x54_Dual_27x27 Mult
54−bit, Dual 27−bit Subtractor
54−bit Adder
54x54_Dual_27x27 Square
34−bit Square
54x54_Dual_27x27 Mult
56−bit, Dual 28−bit Subtractor
56−bit Mantissa Division Result
54x54 Mult
D
yn
am
ic
al
ly
 C
on
fig
ur
ab
le
 B
lo
ck
s
dp_sp
dp_sp
dp m2[51 : 44]
sp1 m2[22 : 15]
a2x
Z = a−11 .a2 − a
−2
1 .a
2
2
W = x.a−11 .(dp sp?β : Z)
x.a−11 a
−1
1 .a2
a−21 .a
2
2
a−41 .a
4
2
α = 1 + a−21 .a
2
2 + a
−4
1 .a
4
2
β = α.Z
x.a−11 −W
sp1 idp sp2 i
a
−
1
1
←
d
p
s
p?{1
′b0
,d
p
s
p2
i[52
:
0]}
:
{3
′b0
,d
p
s
p2
i[52
:
29],3
′b0
,s
p1
i}
a
2
←
d
p
s
p?d
p
m
2[43
:
0]
:
{7
′b0
,s
p2
m
2[14
:
0],7
′b0
,s
p1
m
2[14
:
0]}
a−11
1
2
3
4
5
6
7
8
sp2 m2[22 : 15]
dp m2[52 : 0]
sp1 m1[23 : 0]
sp1 m2[23 : 0]
sp2 m1[23 : 0]
sp2 m2[23 : 0]
x→ dp sp? {1′b0, dp m1} : {3′b0, sp2 m1, 3′b0, sp1 m1}
y → m2→ (a1 + a2)
dp m1[52 : 0]
dp o
sp o
Fig. 4: DPdSP Dual Mode Mantissa Division Architecture
same computational flow. This overlapping leads us to model
efficiently the architecture for mantissa quotient computation,
which can compute either a DP mantissa or two SP mantissa.
The DPdSP mantissa division architecture is shown in
Fig. 4. This proposed dual mode mantissa division architecture
computes either of the eq.(3) and eq.(4) on the same data
path with efficient resource sharing and minimal overhead.
In this architecture, except the computation of a−41 .a
4
2, α and
β (in steps 4, 5 & 6) the entire components are shared and
dynamically configurable for DP as well as both the SP’s
computations. The a−41 .a
4
2, α and β blocks takes part in DP
computation only. The computational flow is as follows.
Initially, in step-1 the precomputed value of a−11 has been
obtained from look-up-table (LUTs). This architecture used
two LUTs; one targeted either for DP or SP-2 operand (of
size 258x53), and other for SP-1 operand (of size 256x24).
First LUT has been shared for DP and SP-2, based on the
respective computation. The a−11 value is determined using a
multiplexer (mux), either for DP or for both SP’s.
In step-2, the computation of x.a−11 and a
−1
1 .a2 is per-
m00m11
m10
X[26:0]W[26:0]W[53:27] X[53:27] W[53:27] 
      +
X[53:27]
W[26:0] 
      +
X[26:0]
SP-2 Product SP-1 Product
DP Product
{{m11,m00[53:27]} + m10 - (m11+m00), m00[26:0]}
Fig. 5: 54x54_Dual_27x27 Multiplier
formed. For x.a−11 , we have used a dual mode multiplier
(54x54_Dual_27x27 Multiplier, Fig. 5), which can perform
either a 54x54 multiplication (for DP) or two 27x27 multipli-
cation (for both SP). Here, the input operands size is 54-bits,
which either contains DP data or both of SP’s data. The mul-
tiplication has been performed using Karatsuba method [14].
Using Karatsuba method, the multiplication logic reduced by
25% incurring an additional cost of some adders and q sub-
tractor. This helps in reducing the area. Computation of x.a−11
has been performed as in eq.(5). For a 54x54 multiplication,
it requires two 27x27 multipliers, one 28x28 multiplier, two
27-bit adders, one 56-bit subtractor and one 91-bit adder. This
dual mode multiplication does not need any extra hardware
cost over only DP, except for an additional mux.
sp1_o[53 : 0] = x[26 : 0]∗a−11 [26 : 0]
sp2_o[53 : 0] = x[53 : 27]∗a−11 [53 : 27]
tmp[55 : 0] = ((a−11 [53 : 27]+a
−1
1 [26 : 0])
∗ (x[53 : 27]+ x[26 : 0]))− (sp2_o+ sp1_o)
dp_o[107 : 0] = {sp2_o,sp1_o}+{tmp,27′h0}
x.a−11 = dp_sp ? dp_o : {sp2_o,sp1_o} (5)
Similarly, the a−11 .a2 has been computed using a
54x44_Dual_27x22 Mult, which performs either a 54x44
multiplication (for DP) or two 27x22 multiplication (for both
SP’s), using the Karatsuba method, as in eq.(6). Here, a2 is
44-bit wide and contains data as shown in Fig. 4.
sp1_o[48 : 0] = a−11 [26 : 0]∗a2[21 : 0]
sp2_o[48 : 0] = a−11 [53 : 27]∗a2[43 : 22]
tmp[55 : 0] = (({a−11 [53 : 27],5
′b0}+a−11 [26 : 0])
∗ (a2[43 : 22]+a2[21 : 0]))− ({sp2_o,5′b0}+ sp1_o)
dp_o[97 : 0] = {sp2_o,sp1_o}+{tmp,22′h0}
a−11 .a2 = dp_sp ? dp_o : {sp2_o,sp1_o} (6)
Step-3 computes a−21 .a
2
2 with a dual mode square,
54x54_Dual_27x27 Square. This has been done on block
basis, with block size of 27-bit, needs three 27x27 multiplier
and one 90-bit adder as follows:
operand← i[53 : 0], sp1_o= i[26 : 0]∗ i[26 : 0]
sp2_o= i[53 : 27]∗ i[53 : 27], tmp= i[53 : 27]∗ i[26 : 0]
dp_o= {sp2_o,sp1_o}+{tmp,28′h0}
a−21 .a
2
2 = dp_sp ? dp_o : {sp2_o,sp1_o} (7)
Step-4 needs a 34-bit square to compute a−41 .a
4
2. This
needed only for DP flow and is implemented using block
basis (with block size of 17-bit) similar to eq.(7). Since
the contribution of this term in DP result falls after 34-
bit precision, a smaller multiplier is required. This step also
computes a−11 .a2− a
−2
1 .a
2
2, using two 27-bit subtractors (for
both SP’s), combined to form a 54-bit subtractor (for DP).
Step-5 and step-6 performs computation related to DP only.
Step-5 computes the last sum (1+ a−21 .a
2
2+ a
−4
1 .a
4
2) term of
eq.(3), using a 54-bit adder. Whereas, step-6 computes the
multiplication of previous sum (α) with DP output components
of Z (in step-4), using a 54x54 multiplier using Karatsuba
method.
Step-7 computes the dual multiplication using
a 54x54_Dual_27x27 multiplier (as in Fig. 5). It
multiplies x.a−11 with either β of step-6 or single
precision outputs of Z (in step-4), to generate W
(x.a−11 .{(a
−1
1 .a2 − a
−2
1 .a
2
2) × (1 + a
−2
1 .a
2
2 + a
−4
1 .a
4
2)} for
DP or x.a−11 .(a
−1
1 .a2 − a
−2
1 .a
2
2) for both SP’s). Finally, the
W has been subtracted by x.a−11 in step-8 using two 28-bit
subtractors (for both SP’s), collectively performs 56-bit
subtraction for DP mantissa quotient.
Thus, the proposed dual mode mantissa division architecture
performs either a double precision division or two single
precision division, with extra costs of a 28×24 LUT for SP-1,
and multiplexers in each dual mode step.
E. Dynamic Right Shifting
The architecture of dual mode dynamic right shifter is
shown in Fig. 6. The input to this unit are mantissa quotient,
and computed right shift amount. The underlying concept of
it’s architecture is similar to the dual mode dynamic left shifter.
In comparison to left shifter, the additional multiplexer in
stage-1 to stage-5 is used to process the lower right shift output
or its combination with primary input of the stage.
F. Normalization, Rounding and Final Processing
This section processes the previously computed exponents
and dual mantissa division result to obtain rounded normalized
format. The output of dual mode mantissa division either
consists of DP mantissa division quotient or consists of both
SP mantissa division quotients in each of its 32-bit parts. Based
on the MSBs of the quotient result, the rounding position
is determined. Further, based on the rounding position bit,
Guard-bit, Round-bit, Sticky-bit and MSB-bit, the round ULP
(unit at last place) has been computed. This ULP computation
has been performed separately for DP and both SP and requires
few gates for each. These round ULP has later been added to
mantissa quotient using two 28-bit adders, individually works
for SP’s computations, and collectively produce the output for
DP. This rounded mantissa sum has been normalized. The
rounding adder in effect is similar to that required for only
DP processing.
Furthermore, the exponents have been updated accordingly.
Then each exponent and mantissa is updated for one of infinity,
sub-normal or underflow cases, and each require separate units.
in
01
dp[5]
[63:0]
in[63:0]<-- {[63:0]} / {[31:0],[31:0]}, SHIFT<-- dp[5:0], sp2[4:0], sp1[4:0]
Shifted Output
in >> 32
Stage-1 =  f(dp[4], sp2[4], sp1[4])
Stage-2 =  f(dp[3], sp2[3], sp1[3])
Stage-3 =  f(dp[2], sp2[2], sp1[2])
Stage-4 =  f(dp[1], sp2[1], sp1[1])
Stage-5 =  f(dp[0], sp2[0], sp1[0])
[31:0]
0101
[63:32]
[31:0]
01
[31:0][63:32]
[63:32]
dp[x] | sp1[x]
dp_sp & dp[x]
dp[x] | sp2[x]
y=2**x
[31-y:0][31+y:32]
>> y>> y
One Stage Unit
Fig. 6: Dual Mode Dynamic Right Shifter
TABLE I: Resource Overhead in DPdSP over DP only
DPdSP Sub-Components Extra resource over DP only
Data Extraction, Sub-
normal & Exceptional
Handler
For one SP computations
LOD Nil
Dynamic Left/Right Shift ≈ Nil
Mantissa Division a 256x24 LUT, and 7- 54-bit 2:1 MUXs
Normalization & Rounding ULP-computation of both SP’s
Final Processing Processing of both SP’s & a 64-bit 2:1
MUX
The computed signs, exponents and mantissas of double pre-
cision and both single precision have been finally multiplexed
to produce the final 64-bit output, which either contains a DP
output or two SP outputs. A brief summary of extra resource
overhead of proposed DPdSP division architecture over only
a DP division is shown in Table-I.
IV. IMPLEMENTATION RESULTS
This section presents the implementation details of the pro-
posed DPdSP divider architecture along with only DP divider
implementation. The proposed DPdSP architecture has been
synthesized with “OSUcells Cell [15]” 0.18µm technology,
using Synopsys Design Compiler. The proposed architecture is
currently aimed towards the single cycle design. The proposed
DPdSP divider architecture has been synthesized with only
normal support and with sub-normal support. A DP only
division design, with same state-of-the-art data path flow,
has also been synthesized (with both, only normal and with
sub-normal) for area & delay overhead computation purpose.
The implementation details has been shown in Table-II. The
DPdSP design with normal support requires roughly 150K gate
(based on minimum inverter size) and ≈ 163K gates when
included with sub-normal support. With only normal support,
TABLE II: ASIC Implementation Details
Only Normal With Sub-normal
DPdSP DP DPdSP DP
Area(µm2) 2395728 2297697 2611098 2441050
Gate Count 149733 143606 163194 152566
Delay (ns) 35.7 34 39.38 38
Delay (FO4) 396.6 377.7 437.5 422.2
TABLE III: Comparison of DPdSP Division Architecture
[7] Proposed Proposed
(Only Normal) (Only Normal) (With Sub-normal)
Area OH1 22% 4.3% 6.9%
Period OH1 8% 5% 3.6%
Gate Count2 212854 149733 163194
Period (FO4)3 31.4 396.6 437.5
Throughput4 29/15 (DP/dSP) 1/1 (DP/dSP) 1/1 (DP/dSP)
Throughput5 909.4 396.6 437.5
Area × Delay 6 1.93×108 0.59×108 0.71×108
1Area/Period OH = (DPdSP - DP) / DP, 2Based on minimum size inverter
31 FO4 (ns) ≈ (Tech. in µm) / 2 4 Throughput (in cycle)
5Throughput (in FO4) = Throughput (in cycle) * Period (FO4)
6Gate Count × Throughput (FO4)
the proposed DPdSP division architecture requires ≈ 4.3%
more hardware and ≈ 5% more delay than only DP division
design, and it needs ≈ 6.9% more resources and ≈ 3.6% more
delay when included with sub-normal support.
A comparison with previously reported dual-mode (double
precision with two-parallel single precision support) division
design in literature has been shown in Table III. To the best
of our knowledge there is no prior work on the related title
using multiplicative methods of division. As reported earlier,
Isseven et. al. has proposed an iterative dual-mode division
architecture using radix-4 SRT division algorithm, a digit-
recurrence method. Their proposed design has a throughput
of 29 clock cycle for double precision, and 15 clock cycle
for single precision. Further, it has been proposed for only
normal support and without sub-normal support. They have
synthesized their proposal using the TSMC 0.25µm technol-
ogy. Table III has shown a comparison of the proposed work
with their work. We have made a technology independent
comparison, where, the area has been compared in terms gate
count (based on minimum size inverter) and delay/throughput
has been compared in terms of FO4 (Fan Out of 4) delay.
Comparatively, Isseven et. al.’s dual-mode architecture needs
a larger area than the proposed DPdSP architecture. The
throughput for double precision processing is much better for
the proposed design. For single precision, the throughput of
Isseven et. al. is 15 clock cycle, an equivalently 471 FO4, is
also larger than the proposed work. The area× delay of the
proposed architecture is much better than the previous work.
The proposed architecture can be easily pipelined to have
a even better throughput. Also, the area & delay overhead,
over only DP divider, of proposed work is smaller than the
previous work in Isseven et. al.. Also, in addition to better
design metrics, the proposed work also support the processing
of sub-normal operands.
V. CONCLUSIONS
This paper has presented an architecture for floating point
division with on-the-fly dual precision support. It supports
configurable, double precision with dual (two-parallel) sin-
gle precision (DPdSP) floating point division computation.
It support normal and sub-normal operands processing. The
data path of the architecture has been tuned to perform the
dual mode computation with minimal hardware overhead. The
crucial module of mantissa division has been tuned with other
components (dual mode LOD, dual mode dynamic left/right
shifter, rounding) for on-the-fly dual mode computation. It
has ≈ 4.3%− 6.9% area and ≈ 5% delay overhead over DP
only module. The proposed division architecture is the only
known multiplicative based, dynamically configurable, on-the-
fly multi-precision supported design. Compared to previous
literature, this work out-perform them in terms of required
area, throughput, and area×delay; has smaller area & delay
overhead over only DP divider, with more computational
support. Future work will focus on configurable architecture
for other division methods (Newton-Raphson, Goldscmidts).
ACKNOWLEDGMENT
This work was supported by a grant from Croucher Startup
Allowance (Project No. 9500015).
REFERENCES
[1] A. Baluni, F. Merchant, S. K. Nandy, and S. Balakrishnan, “A fully
pipelined modular multiple precision floating point multiplier with
vector support,” in Electronic System Design (ISED), 2011 International
Symposium on, 2011, pp. 45–50.
[2] K. Manolopoulos, D. Reisis, and V. Chouliaras, “An efficient multiple
precision floating-point multiplier,” in Electronics, Circuits and Systems,
2011 18th IEEE International Conference on, 2011, pp. 153–156.
[3] A. AkkaÅ§ and M. J. Schulte, “Dual-mode floating-point multiplier
architectures with parallel operations,” Journal of Systems Architecture,
vol. 52, no. 10, pp. 549 – 562, 2006.
[4] A. Akkas, “Dual-Mode Quadruple Precision Floating-Point Adder,”
Digital Systems Design, Euromicro Symposium on, pp. 211–220, 2006.
[5] ——, “Dual-mode floating-point adder architectures,” Journal of Sys-
tems Architecture, vol. 54, no. 12, pp. 1129–1142, Dec. 2008.
[6] M. Ozbilen and M. Gok, “A multi-precision floating-point adder,” in
Research in Microelectronics and Electronics, 2008. PRIME 2008.
Ph.D., 2008, pp. 117–120.
[7] A. Isseven and A. Akkas, “A dual-mode quadruple precision floating-
point divider,” in Signals, Systems and Computers, 2006. ACSSC ’06.
Fortieth Asilomar Conference on, 2006, pp. 1697–1701.
[8] S. F. Obermann and M. J. Flynn, “Division algorithms and implementa-
tions,” Computers, IEEE Transactions on, vol. 46, no. 8, pp. 833–854,
Aug. 1997.
[9] X. Wang and M. Leeser, “Vfloat: A variable precision fixed- and
floating-point library for reconfigurable hardware,” ACM Trans. Recon-
figurable Technol. Syst., vol. 3, no. 3, pp. 16:1–16:34, Sep. 2010.
[10] J.-C. Jeong, W.-C. Park, W. Jeong, T.-D. Han, and M.-K. Lee, “A cost-
effective pipelined divider with a small lookup table,” Computers, IEEE
Transactions on, vol. 53, no. 4, pp. 489–495, 2004.
[11] M. K. Jaiswal and R. C. C. Cheung, “High Performance Reconfigurable
Architecture for Double Precision Floating Point Division,” in 8th
International Symposium on Applied Reconfigurable Computing (ARC-
2012). Hong Kong: Springer LNCS, March 2012, pp. 302–313.
[12] “IEEE Standard for Binary Floating-Point Arithmetic,” ANSI/IEEE Std
754-1985, 1985.
[13] “IEEE Standard for Floating-Point Arithmetic,” Tech. Rep., Aug. 2008.
[14] A. Karatsuba and Y. Ofman, “Multiplication of Many-Digital Numbers
by Automatic Computers,” in Proceedings of the USSR Academy of
Sciences, vol. 145, 1962, pp. 293–294.
[15] Oklahoma State University, OSUCells, http://vlsiarch.ecen.okstate.edu.
