Dual-mode double precision division architecture by Jaiswal, MK & So, HKH
Title Dual-mode double precision division architecture
Author(s) Jaiswal, MK; So, HKH
Citation
Proceedings of 2016 IEEE 59th International Midwest
Symposium on Circuits and Systems (MWSCAS), Abu Dhabi,
United Arab Emirates, 16-19 October 2016, p. 1-4
Issued Date 2016
URL http://hdl.handle.net/10722/247792
Rights
International Midwest Symposium on Circuits and Systems
Conference Proceedings. Copyright © IEEE.; ©2016 IEEE.
Personal use of this material is permitted. Permission from IEEE
must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.; This
work is licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License
Dual-Mode Double Precision Division Architecture
Manish Kumar Jaiswal1, and Hayden K.-H So2
Dept. of EEE, The University of Hong Kong, Hong Kong
Email: 1manishkj@eee.hku.hk, 2hso@eee.hku.hk
Abstract—This paper presents an area efficient architecture for
a dual-mode double precision floating point division, which can
either process a double precision (DP) division or two parallel
single precision (SP) division. The dual-mode mantissa division
architecture is based on the series expansion methodology,
and implemented in an iterative fashion. A dual-mode Radix-4
Modified Booth multiplier is designed for this purpose, which
is used iteratively in the architecture of dual-mode mantissa
computation. The proposed dual-mode division architecture is
synthesized using UMC 90nm technology ASIC implementation.
The proposed architecture shows better design metrics in terms
of required area, time-period and throughput as against prior
literature work.
Index Terms—Floating Point Division, Dual-Mode Architec-
ture, ASIC, Configurable Architecture.
I. INTRODUCTION
Floating point arithmetic (FPA) is a basic ingredient of a
large set of scientific and engineering domain applications. To
boost the application performances, arrays of single precision
and double precision computing units are being used as
floating point vector processing. The current research work is
aimed towards the idea of unified vector-processing units. That
is, instead of having separate vector arrays of single precision
and double precision, it can have an array of configurable float-
ing point arithmetic blocks. Where each of these configurable
blocks can process either a double precision or two parallel
single precision computations. This configurable block array
arrangement can lead to significant area improvement, while
providing the required performance.
This paper is focused on the design of a dual-mode double
precision division arithmetic unit. Only few literature are
available on the dual-mode double precision division archi-
tecture [1], [2]. Floating point (FP) division is a core compu-
tation required in a multitude of applications. The proposed
architecture can be configured either for a double precision
or two parallel (dual) single precision division computations,
and named as DPdSP division architecture. The proposed
architecture is based on the series expansion methodology
of division algorithm ([3]). Series expansion method is a
multiplicative division method, which provides a hardware
efficient architecture for a given precision requirement [4]. A
dual-mode Radix-4 Modified Booth multiplier is designed for
the purpose of mantissa division. Also, since the underlying
integer multiplier in mantissa division unit has the major cost
in terms of required area, an iterative architecture is proposed
to achieve area efficiency. The present work is build upon the
[2], with a practical approach for DPdSP division architecture.
Compared to [2], which has presented an impractical single
cycle implementation with large area requirement, the present
work provides a different architecture for the most complex
unit, the mantissa division (with some novel architectural
proposal), which constitutes more than 80% of hardware
resources in FP division architecture.
The main contributions of this work can be briefly summa-
rized as follows:
• Proposed dual-mode DPdSP division architectures with
sub-normal computational support, which can be config-
ured either for a double precision division or two parallel
single precision divisions.
• A novel dual-mode Radix-4 Modified Booth multiplier
architecture is proposed, which is the main constituent of
the proposed dual-mode mantissa division architecture.
II. UNDERLYING MANTISSA DIVISION METHOD
The algorithmic methodology for mantissa division archi-
tecture is discussed here. It is based on the series expansion
method of division, as follows.
Let m1 be the normalized dividend mantissa and m2 be the
normalized divisor mantissa, then q, the mantissa quotient, can
be computed as:
q=
m1
m2
=
m1
a1+a2
= m1(a1+a2)
−1 = m1(a
−1
1 −a
−2
1 a2+a
−3
1 a
2
2−a
−4
1 a
3
2) (1)
where, a1 and a2 are parts of division mantissa as below.
m2 →
a1
︷ ︸︸ ︷
1.xxxxxxxx
︸ ︷︷ ︸
W−bit
a2
︷ ︸︸ ︷
xx . . . . . . . . . .xxxxxxx
Here, the pre-computed value of a−11 acts as an initial
approximation for m−12 , which further improves with remain-
ing computation in (1). Here, the size W (bit width) of a1
determines the size of memory (to store a−11 ) and the number
of terms from the series expansion, to perform the computation
for a given precision. For a good balance among W and
required number of terms, bit width ofW = 8 for a1 is selected,
which requires 7 terms (up to a−71 a
6
2) for double precision, and
3 terms (up to a−31 a
2
2) for single precision requirement. For
dual-mode architecture design, a unified equation for double
and single precision processing is formulated as below.
q=
SP
︷ ︸︸ ︷
m1a
−1
1 −m1a
−1
1 (a
−1
1 a2−a
−2
1 a
2
2)(1+a
−2
1 a
2
2+a
−4
1 a
4
2)
︸ ︷︷ ︸
DP
(2)
Final Output (64-bit)
{dp_m1,11’b0}
dp_sp
1 0
{sp2_m1,8’b0,sp1_m1,8’b0}
1 0
{dp_m2,11’b0}{sp2_m2,8’b0,sp1_m2,8’b0}
M1M2
Dynamic Left Shift 64_Dual32 Dynamic Left Shift 64_Dual32
M2 M1
LOD 64_Dual32 LOD 64_Dual32
sp1_ls2dp_ls2sp2_ls2
Dual Mantissa 
Division Architecture
div_M-> (dp_M/{sp2_M,sp1_M})
Dynamic Right Shift 64_Dual32
sp1_ls1dp_ls1sp2_ls1
sp1_rsdp_rssp2_rs
Normalization & Exceptional Handling
1 0
dp_out {sp2_out,sp1_out}
in1 (Dividend)in2 (Divisor)
Data Extraction & SubNormal Handler
Sign, Exp, & R-Shift-Amount
dp: Double Precision     sp: Single Precision    _ls: Left-Shift    _rs:Right-Shift
Dual-Mode RoundingSign & Exp 
_s: Sign      _e : Exponent       _m: Mantissa       dp_sp:Mode (DP/Dual-SP)
m1m2
Fig. 1: DPdSP Division Architecture
32-bit
23-bit23-bit8-bit 8-bit
52-bit11-bit
64-bit  (in2)
DP-in2[63:32] / SP2-in2 DP-in2[31:0] / SP1-in2
DP-in1[31:0] / SP1-in1
32-bit
23-bit23-bit8-bit 8-bit
52-bit11-bit
DP-in1[63:32] / SP2-in1
64-bit  (in1)
Fig. 2: DPdSP Input Output Format
Here, SP part computes for single precision, while entire
equation process the double precision. The interesting fea-
ture of (2) forms the basis of sharing hardware resources
to efficiently model the dual-mode architecture for mantissa
division computation, which is capable of processing either a
DP mantissa or two SP mantissa divisions. The size of look-
up table to store a−11 is taken as 2
8×53 for DP and 28×24
for SP, which is sufficient for both precision.
III. PROPOSED DPDSP DIVISION ARCHITECTURE
The proposed architecture is shown in Fig. 1. It is composed
of three pipelined stages. Two 64 bit operands, one dividend
(in1) and another divisor (in2) are the primary inputs along
with the mode-control signal dp_sp (double precision or dual
single precision). Both of the input operands either contains
DP operands (as entire 64-bit pair) or two parallel SP operands
(as two sets of 32-bit pair), as shown in Fig. 2.
A. First-Stage Architecture
First stage process for data-extraction, exceptional case
handling, and sub-normal processing. It also includes the
part of mantissa division unit, the pre-fetching of initial
approximation of divisor mantissa inverse from look-up table.
The data extraction computation takes the primary operands
and extract the signs, exponents and mantissas components
DP_SP2  LUT SP1  LUT
sp2_m2[22:15]dp_m2[51:44]
dp_sp 01 sp1_m2[22:15]
sp1_m2_a1^{-1}dp_sp2_m2_a1^{-1}
dp_sp 01
a1^{-1} First Stage
Second Stage
Modified Booth Multiplier
54x54_dual24x24
in1_t1 in1_t2 in2
REGISTERS
mult_dp mult_sp1mult_sp2
STATE
Fig. 3: DPdSP Dual Mode Mantissa Division Architecture
for double precision and both single precision, based on
their standard formats. The sub-normal (_sn) handling and
exceptional checks computations are done using traditional
methods.However, as the 8 MSB bits of DP exponent overlap
with SP-2 exponent (as shown in Fig 2), the checks for sub-
normal, infinity and NaN (Not-A-Number) have been shared
among SP-2 and DP. Similarly, it also performs checks for
divide-by-zero (_dbz) and zero (_z), and have been shared
among DP and both SPs.
After above processing, a unified set of mantissa (M1 and
M2) is generated using two 2:1 MUX (as shown in Fig. 1),
which contain the mantissa either for DP or for both SPs.
This unification of mantissas helps in designing a tuned data-
path processing for later stage computation, which results in
efficient resource sharing. The next two units, the leading-one-
detector (LOD) and dynamic left shifter, in this stage perform
sub-normal processing. They bring the sub-normal mantissa
(if any) into the normalized format. The details on dual-mode
LOD and dual-mode dynamic left shifter architecture can be
sought from [2].
Above processing produces mantissas into normalized form
m1 and m2, as shown in Fig. 1. Further, in this stage of
architecture, the 8-bit MSB part (a1) of normalized divisor
mantissas (m2) are used to fetch the pre-computed initial
approximation of their inverse. It is shown in the first-stage
part of Fig. 3, DP_SP2 LUT (256x53) is shared for DP and
SP-2 initial approximation, and SP1 LUT (256x24) works for
SP-1 only.
B. Second-Stage Architecture
This stage of architecture computes the sign, exponent
and mantissa processing of FP division arithmetic and the
computation related to right shift amount. The computations
related to exponent and right shift amount processing are done
using traditional methods. These computations are processed
separately for DP and both SPs.
The dual mode mantissa division processing is the most
crucial component of the FP division architecture. The man-
tissa computation architecture includes the unified and dual-
mode implementation of (2). This computation is built around
a dual-mode booth multiplier, in an iterative fashion. A dual-
Kogge-Stone adder
PP1<-F(in1_t1[53:0])
14 Partial-Products
PP2<-F(in1_t2[53:0])
14 Partial-Products
Partial-Product Using Radix-4 Modified Booth Encoding
PP1-SP1
PP2-SP2
12-Bit
dp_mult[107:0]
sp2_mult[47:0]
[107:60]
sp1_mult[47:0][47:0]
in2[53:0]in1_t2[53:0]in1_t1[53:0]
DADDA-Tree (8-Levels)
Multiplier (in2):
Multiplicand (in1):
in1_t1[53:0] <- dp_sp ? dp_in1 : {30’b0, sp1_in1}
in1_t2[53:0] <- dp_sp ? dp_in1 : {sp2_in1, 30’b0}
in2[53:0] <- dp_sp ? dp_in2 : {sp2_in2, 6’b0, sp1_in1}
Fig. 4: Dual-Mode Modified Booth Multiplier Architecture
mode finite state machine (FSM) is designed which decides
the effective inputs for multiplier in each state.
1) Dual-Mode Radix-4 Modified Booth Multiplier Architec-
ture: The architecture is based on the Radix-4 Modified Booth
Encoding and shown in Fig 4. It is a 54-bit integer multiplier
(for DP processing), which can also process two parallel
sets of 24-bit unsigned operands (for two SPs processing)
multiplication. The presented dual-mode multiplier has three
input operands (two multiplicands and a multiplier). A set of
two inputs (in1_t1 and in1_t2) forms the multiplicand operands.
Here, in1_t1 consists of either ‘DP multiplicand operand’ or
‘SP-1 multiplicand operand at the LSB side’, and in1_t2 con-
sists of either ‘DP multiplicand’ or ‘SP-2 multiplicand operand
at the MSB side’. While, the multiplier input (in2) contains
multiplier operands either for DP, or for both SPs with 6-bit
zero in between (see top portion of Fig. 4). Correspondingly,
two-sets of partial products (PP1 and PP2) are generated.
Partial products PP1 are the result of in1_t1 and in2, and PP2
is derived from in1_t2 and in2. Here, the inputs in1_t1, in1_t2
and in2 are built so that, in dual-SP mode processing the single
precision partial products (PP1-SP1 and PP2-SP2) and their
reduction do not overlap (Fig. 4), and produce two distinct
results for SP-1 and SP-2 multiplication, respectively.
Therefore, the sum of all partial products will generate
product for DP operands in DP-mode or for both SPs in dual-
SP mode. A DADDA-tree of 8 levels is designed to compress
all the partial products into two operands, which are further
added using a parallel-prefix Kogge-Stone final adder. The
final product contains either DP or dual-SP results as shown
in Fig. 4. Compared to the contemporary Modified Booth
multiplier, the proposed dual-mode Modified Booth multiplier
requires only three 2:1 MUXs as an area overhead, which are
needed for the input operands multiplexing.
2) Dual-Mode Iterative Mantissa Division Architecture:
The mantissa division is designed in an iterative fashion to
S1 S2 S3
S4S5S6S7
dp_sp = 0
S0
done=0
S8
done=1
A B C
D, E
FGA.G, A.EH
I E
Fig. 5: DPdSP Dual-Mode Iterative Mantissa Division FSM
have an area efficient architecture. The architecture is based
on the unified implementation of (2), which can process either
a DP mantissa division or two parallel SPs mantissa divisions
by inclusion of above discussed dual-mode modified Booth
multiplier. Here, m1 (dividend) and m2 (divisor) are normalized
mantissas which contain either DP mantissas (dp_m1[52 :
0] and dp_m2[52 : 0]) or both SPs mantissas (sp1_m1[23 :
0], sp1_m2[23 : 0], sp2_m1[23 : 0] and sp2_m2[23 : 0]), as
shown in Fig. 3. Divisor mantissa (m2) is partitioned into a1
(first 8-bit right to the decimal point) and a2 (all remaining
bits right to the a1), for DP and both SPs, as below.
m2 →
a1
︷ ︸︸ ︷
1.xxxxxxxx
︸ ︷︷ ︸
8bit
a2
︷ ︸︸ ︷
xxxxxxx . . . . . . . . . .xxxxxxx
︸ ︷︷ ︸
DP:44−bit, SP:15−bit
For the ease of understanding, various terms of (2) are listed
in (3). From these abbreviations in (3), for SPs computation, it
only requires to skip the computation of D, F , G and HDP from
DP flow. A 9 state (S0 to S8) FSM is designed for this purpose.
Each state of FSM determines the inputs (in1_t1, in1_t2 and
in2) for dual-mode modified booth multiplier, and assigned its
output to the designated terms, which proceeds as follows:
A= m1.a
−1
1 , B= a
−1
1 .a2, C = B
2 = a−21 .a
2
2, D= B
4 =C2 = a−41 .a
4
2
E = B−C = a−11 a2−a
−2
1 a
2
2, F = 1+C+D= 1+a
−2
1 a
2
2+a
−4
1 a
4
2
G= EF, HDP = AG HSP = AE I = A−H (3)
S0 : in1_t1 = dp_sp ? {1
′b0,dp_m1} : {30
′b0,sp1_m1}
in1_t2 = dp_sp ? {1
′b0,dp_m1} : {sp2_m1,30
′b0}
in2 = dp_sp ? {1
′b0,dp_m2_a
−1
1 } : {sp2_m2_a
−1
1 [52 : 29],6
′b0,sp1_m2_a
−1
1 }
S1 : in1_t1 = dp_sp ? {10
′b0,dp_m2_a2} : {30
′b0,9′b0,sp1_m2_a2}
in1_t2 = dp_sp ? {10
′b0,dp_m2_a2} : {9
′b0,sp2_m2_a2,30
′b0}
in2 = dp_sp ? {1
′b0,dp_m2_a
−1
1 } : {sp2_m2_a
−1
1 [52 : 29],6
′b0,sp1_m2_a1−1}
A[63 : 0] = dp_sp ? dp_mult[105 : 42] : {sp2_mult[47 : 16],sp1_mult[47 : 16]}
S2 : in1_t1 = dp_sp ? dp_mult[96 : 43] : {30
′b0,sp1_mult[38 : 15]}
in1_t2 = dp_sp ? dp_mult[96 : 43] : {sp2_mult[38 : 15],30
′b0}
in2 = dp_sp ? dp_mult[96 : 43] : {sp2_mult[38 : 15],6
′b0,sp1_mult[38 : 15]}
B[63 : 0] = dp_sp ? dp_mult[96 : 43] : {sp2_mult[38 : 12],sp1_mult[38 : 12]}
S3 : in1_t1 = in1_t2 = in2 = dp_mult[107 : 54] CDP = dp_mult E = B−C
C = dp_sp ? {8′b0,dp_mult[107 : 62]} : {8′b0,sp2_mult[47 : 29],8
′b0,sp1_mult[47 : 29]}
S4 : in1_t1 = in1_t2 = in2 = 0 DDP = dp_mult[107 : 87]
FDP[53 : 0] = {1
′b1,16′b0,CDP[107 : 71]}+{33
′b0,DDP}
S5 : in1_t1 = in1_t2 = E in2 = FDP
S6 : in1_t1 = dp_sp ? G : {30
′b0,E[26 : 3]} G= dp_mult[107 : 54]
in1_t2 = dp_sp ? G : {E[26 : 3],30
′b0} in2 = A
S7 : in1_t1 = in1_t2 = in2 = 0, AE = {8
′b0,sp2_mult[47 : 24],8
′b0,sp1_mult[47 : 24]}
AG= {7′b0,dp_mult[107 : 51]} H = dp_sp ? AG : AE
S8 : I = A−H in1_t1 = in1_t2 = in2 = 0 (4)
The finite state machine (FSM) is shown in Fig. 5. For DP
processing it goes through all the states, whereas for dual-
SP it skips states S4 and S5 which performs only DP related
computations. The selection of bits for a term is based on
the position of decimal point and mode of the processing.
Generally, for DP mode, the multiplications are done in 54-
bit (sufficient for it’s precision requirement) and add/sub are
performed in 64-bit (to preserve precision), whereas, for dual-
SPs, the multiplications are done in 24-bit and add/sub are
performed in 32-bit. The mantissa division requires 9 cycles
for DP-mode processing, while only 7-cycles for dual-SPs
processing. Compared to the only DP mantissa division FSM,
the DPdSP mantissa division FSM requires 14 54-bit 2:1
MUXs as an overhead.
C. Third-Stage Architecture
In this stage, for the case of exponent underflow, mantissa
division quotient is first process for the dynamic right shifting.
This is followed by the dual-mode rounding (rounding to
nearest is implemented) of the quotient mantissa, and then
it undergoes normalization and exceptional case processing.
The architectural details of dual-mode dynamic right shifter
can be sought from [2], which can shift either a DP mantissa
or two-parallel SP mantissa. It takes right-shift-amount and
mantissa quotient as primary inputs. Rounding first computes
the unit-at-last-place (ULP) separately for DP and both SPs,
and performs ULP addition. The ULP-addition with quotient
mantissa is shared among DP and both SPs by using two
32-bit incrementer, which individually acts like a SP ULP-
adder, however, their combination (by propagating carry) also
performs for DP ULP-addition. The rounded mantissa quotient
is further normalized separately for DP and both SPs, which
requires 1-bit right shifting. And corresponding exponents
are incremented by one, separately for DP and both SPs.
Further to this, each exponent and mantissa is updated for
exceptional cases (either of infinity, subnormal or underflow
cases), which needs separate units for DP and both SPs.
Finally, the computed signs, exponents and mantissas for
double precision and both single precision are multiplexed
using a 64-bit 2:1 MUX to produce the final 64-bit output
floating point quotient result, which either contains the DP
quotient or two SPs quotients .
IV. IMPLEMENTATION RESULTS
The proposed architecture is synthesized with UMC 90nm
standard cell ASIC library, using Synopsys Design Compiler,
with best achievable timing constraints. It has a latency of
11 cycles and throughput of 10 cycles for DP computation,
a latency of 9 cycles and throughput of 8 cycles for dual-
SP computations. The functional verification is carried out
using 5-millions random test cases for each of the normal-
normal, normal-subnormal, subnormal-normal and subnormal-
subnormal operands combination, along with the other excep-
tional case verification, for both DP and dual-SP mode. It
produces a maximum of 1-ULP (unit at last place) precision
loss which is sufficient for a large amount of applications.
TABLE I: Comparison of DPdSP Division Architecture
[1] [2] Proposed
(Only Normal) (SubNormal) (SubNormal)
Gate Count1 212854 163194 66416
Period (FO4)2 31.4 437.5 38.22
Throughput3 29/15 (DP/dSP) 1/1 (DP/dSP) 10/8 (DP/dSP)
Area × Period
× Throughput 4 193.82×106 71.39×106 25.38×106
1Based on minimum size inverter 21 FO4 (ns) ≈ (Tech. in µm) / 2
3in clock-cycle 4Gate Count × Period (FO4) × Throughput
A technological independent comparison is presented in
Table-I, in terms of Gate-Count for area, FO4-delay for tim-
ings, cycle counts for latency & throughput and in terms of an
unified metric Area×Period (FO4)×Throughput (in clock−
cycle) (which should be smaller for a better design). Isseven
et. al. [1] has presented an iterative DPdSP division archi-
tecture using Radix-4 SRT division algorithm, without sub-
normal support. Compared to proposed architecture, Isseven
et. al.’s architecture requires much larger area and has poor
Area×Period×Throughput metric. The prior work presented
in [2] also requires a significantly large area with a poor
Area× Period × Throughput. Also, due to its single cycle
implementation of [2], this design is not practical. Thus, the
currently proposed architecture is better in terms of design
metrics. To the best of author’s knowledge, literature does not
contains any other dual-mode division architecture, which can
support DP with two parallel SP divisions.
V. CONCLUSIONS
This paper has presented a dual-mode iterative architecture
for DP FP division arithmetic. It can process either a DP or
two-parallel SPs floating point division. All the components
are designed for efficient dual-mode processing and a novel
dual-mode Radix-4 Modified Booth multiplier architecture is
proposed with minimal overhead. The proposed dual-mode
architecture outperforms the prior arts in terms of various
design metrics.
VI. ACKNOWLEDGMENTS
This work is party supported by the “The University of
Hong Kong” grant (Project Code. 201409176200), the “Re-
search Grants Council” of Hong Kong (Project ECS 720012E),
and the “Croucher Innovation Award” 2013.
REFERENCES
[1] A. Isseven and A. Akkas¸, “A dual-mode quadruple precision floating-
point divider,” in Signals, Systems and Computers, 2006. ACSSC ’06.
Fortieth Asilomar Conference on, 2006, pp. 1697–1701.
[2] M. Jaiswal, R. Cheung, M. Balakrishnan, and K. Paul, “Configurable ar-
chitecture for double/two-parallel single precision floating point division,”
in VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on,
July 2014, pp. 332–337.
[3] S. F. Obermann and M. J. Flynn, “Division algorithms and implementa-
tions,” Computers, IEEE Transactions on, vol. 46, no. 8, pp. 833–854,
Aug. 1997.
[4] M. K. Jaiswal, R. Cheung, M. Balakrishnan, and K. Paul, “Series
expansion based efficient architectures for double precision floating
point division,” Circuits, Systems, and Signal Processing, vol. 33,
no. 11, pp. 3499–3526, 2014. [Online]. Available: http://dx.doi.org/10.
1007/s00034-014-9811-8
