Architecture for quadruple precision floating point division with multi-precision support by So, HKH & Jaiswal, MK
Title Architecture for quadruple precision floating point division withmulti-precision support
Author(s) Jaiswal, MK; So, HKH
Citation
Proceedings of 2016 IEEE 27th International Conference on
Application-specific Systems, Architectures and Processors
(ASAP), London, UK., 6-8 July 2016, p. 239-240
Issued Date 2016
URL http://hdl.handle.net/10722/229799
Rights
International Conference on Application-Specific Systems,
Architecture and Processors (ASAP) Proceedings. Copyright ©
IEEE, Computer Society.; ©2016 IEEE. Personal use of this
material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale
or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.; This work is licensed
under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License
Architecture for Quadruple Precision Floating Point Division with Multi-Precision
Support
Manish Kumar Jaiswal1, and Hayden K.-H So2
Dept. of EEE, The University of Hong Kong, Hong Kong;
Email: 1manishkj@eee.hku.hk, 2hso@eee.hku.hk
Abstract—This paper proposes a FPGA based hardware
architecture for quadruple precision (QP) division arithmetic
which can also process a single, a double and a double-extended
precision (SP, DP, DPE) computations. The mantissa division
employs a series expansion methodology of division, integrated
with a wide integer multiplier further optimized for FPGA
implementations facilitating the built-in DSP blocks efficiently.
The proposed division architecture is demonstrated using a
Xilinx FPGA based implementation has shown a significant
area saving and much improvement in latency with improved
speed.
Keywords-Quadruple Precision Arithmetic, Division, FPGA,
Multi-Precision Division.
I. INTRODUCTION
A large number of important applications demand for a
higher precision computation [1] that can be supported by
quadruple precision (QP) arithmetic, which provides roughly
30 decimal digits of precision. To effectively accelerate
this class of high-precision, we need a efficient support
in hardware accelerator. In this view, this paper proposes
a multi-precision division architecture that is capable of
performing up to QP operation in hardware. The main
contributions of present work can be summarized as follows:
• Proposed a multi-precision quadruple precision floating
point division architecture which also supports the
processing of SP, DP and DPE precision computation.
It is based on the series expansion methodology of
division.
II. PROPOSED MULTI-PRECISION DIVISION
ARCHITECTURE
For the purpose of multi-precision processing, the in-
put/output operands for the unified floating point formats are
assumed as shown in Fig. 1. The proposed multi-precision
division architecture works in four modes, each for SP, DP,
DPE and QP processing mode. It consists of three stages:
pre-processing, core computations and post-processing.
The first stage of the architecture includes data-extraction,
sub-normal handling and exceptional checks and are imple-
mented using typical methods . Since, the decimal point
position (in input operands) is same for all modes (as
shown in Fig. 1), unified/same signal for sign, exponent and
mantissa works for all mode. This stage also includes the
part of mantissa division unit, as discussed in the later part.
[126:112] [111:0][127]
SP-Exp[119:112]
DP-Exp[122:112]
DPE-Exp[126:112]
SP-Mant[111:89]
DP-Mant[111:60]
DPE-Mant[111:48]
QP-Mant[111:0]QP-Exp[126:112]Sign
Figure 1: Input/Output Register Format
A. The Core Division Processing Architecture
The stage-2 consists of the core operation which computes
sign, exponent and right-shift-amount is processed in trivial
way, by using a unified “BIAS” signal for multi-precision
environment (BIAS[14 : 0] = {{4{QP|DPE}},{3{QP|DPE|DP}},7′b7F}).
1) Mantissa Division Unit: The methodology for this
is based on the [2]. Let m1 be the normalized dividend
mantissa, m2 be the normalized divisor mantissa then q,
the mantissa quotient, can be computed as (1). Here, m2
is partitioned in to two part as a1 (W -bit) and a2 (all
remaining bits) as in (2). Equation(1) can be solve using only
multipliers, adders and subtractors, provided that the value
of a−11 is available which can be access from a pre-stored
look-up table. For a bit width of W = 8 for a1, it requires 17
terms (up to a−171 a
16
2 ) for QP, 9 terms (up to a
−9
1 .a
8
2) for DPE,
7 terms (up to a−71 .a
6
2) for DP, and 3 terms (up to a
−3
1 .a
2
2) for
SP precision requirement. For a multi-precision architectural
implementation, an unified expression is structured in (3)
which supports all the required precision computations. The
size of look-up table (LUT) to store a−11 is taken as 2
8×113,
and a full multiplier of size 114x114-bit is used iteratively
using a FSM (Finite State Machine) to implement (3).
q =
m1
m2
=
m1
a1+a2
= m1(a1+a2)
−1 = m1(a
−1
1 −a
−2
1 a2+a
−3
1 a
2
2−a
−4
1 a
3
2 . . .) (1)
m2 =
a1
︷ ︸︸ ︷
1.xxxxxxxx
a2: QP:104−bit, DPE:56−bit, DP:44−bit, SP:15−bit
︷ ︸︸ ︷
xxxxxxx . . . . . . . . . . . . . . . . . . . . . .xxxxxxx (2)
q =
DP
︷ ︸︸ ︷
SP
︷ ︸︸ ︷
m1.a
−1
1 −m1.a
−1
1 (a
−1
1 .a2−a
−2
1 .a
2
2)(1+a
−2
1 .a
2
2+a
−4
1 .a
4
2+a
−6
1 .a
6
2)
︸ ︷︷ ︸
DPE
(1+a−81 .a
8
2)
︸ ︷︷ ︸
QP
(3)
A = m1a
−1
1 , B = a
−1
1 a2, C = a
−2
1 a
2
2, D = a
−4
1 a
4
2, E = a
−6
1 a
6
2
F = a−81 a
8
2, G = B−C, HT = 1+C+D, H = HT +F, I = 1+F
J = GH, K = JI, L = AK, M = A−L (4)
114x114 Bit Multiplier
Mult	
REGISTERSSTATE
Second−Stage
First−Stage
LUT 256x113
in1 in2
M2[111 : 104]
a
−1
1
S1
S2
S3
S4
S5S6S7
S8
A
B
C
D, G
F E, H_T
L
K
I, J
S9
M S10 S0
Done=0Done=1
mode=00 (SP)
mode=01 (DP)
mode=01/10 (DP/DPE)
G
H_T
J
Figure 2: Mantissa Division Architecture and FSM
The implementation of eq.(3) (as shown in Fig. 2), incor-
porates a LUT, a 114x114 multiplier and a FSM. Based on
the mode of operation, the FSM decides the effective inputs
for the multiplier in each state and assigned its output to the
designated terms in (4). A single stage, 114x114 multiplier
is designed around DSP48E IPs, using combination of 3-
partition and 2-partition Karatsuba method [3]. Initially, mul-
tiplier operands are partitioned into 3 sets of 38-bit, which
requires 3 38x38 and 3 39x39 multipliers. The 39x39 (also
used as 38x38) multiplier is designed using two partition
method, which needs one 19x19, one 20x20 and one 21x21
multipliers. The 19x19, 20x20 and 21x21 multipliers are
implemented by using a DSP48E and some logic resources.
It requires only 18 DSP48E blocks for 114x114 multiplier.
S0 : in1= {1′b0,m1}, in2= {1
′b0,m2_a1_i}
S1 : in1= {10′b0,m2_a2}, in2= {1
′b0,m2_a1_i}, A[127 : 0] = mult[225 : 98]
S2 : in1= in2= B = mult[216 : 103]
S3 : in1= in2= {18′b0,mult[227 : 132]}, C = mult, G = B−{8′b0,C[227 : 122]}
S4 : in1= {50′b0,mult[227 : 132]}, in2= {50′b0,C[227 : 164]}, D = mult[191 : 0]
HT = {1
′b1,16′b0,C[227 : 130}+{33′b0,D[191 : 110}
S5 : in1= in2= {66′b0,D[191 : 144]}, E = mult[127 : 0]
S6 : in1= G, in2= DP ? HT : (H ← HT +{49
′b0,E[127 : 62])}
F = mult[95 : 0], I = {1′b1,64′b0,F [95 : 47]}
S7 : in1= J = mult[227 : 114], in2= I
S8 : in1= SP ? G : (J or K ← mult[227 : 114]), in2= A[127 : 14]
S9 : L = (QP|DPE) ? {6′b0,mult227,106]} : (DP ? {7′b0,mult[227 : 107]}
: {8′b0,mult[227 : 108]}), in1= in2= 0
S10 : in1= in2= 0,
(5)
FSM consists of 11 states (S0 to S10). For QP-mode it
passes through all the states (S0 to S10). Whereas, for DPE
mode it skips the state S7. DP-mode does not requires the
processing of state S5 and S7; and the states S4-to-S7 are
not required in SP-mode. Some mode specific assignments
can also be seen in the states S6, S8, and S9 using mode
control signals (QP, DPE, DP, and SP). This FSM requires
11 cycles, 10 cycles, 9 cycles and 7 cycles respectively for
QP-mode, DPE-mode, DP-mode and SP-mode processing.
The post-processing stage performs normalization, rounding
(round-to-nearest) and final-processing, which all are
done using trivial methods, over unified mantissa for multi-
precision processing.
Table I: Comparison of Division Architecture
[4] Proposed Multi-Precision Arch.
Latency 118 (QP) 9/11/12/13 (SP/DP/DPE/QP)
Throughput NA (QP) 8/10/11/12 (SP/DP/DPE/QP)
LUTs 26811 7440
FFs 13809 2584
DSP48 - 18
Freq (MHz) 50 89
III. IMPLEMENTATION RESULTS
The proposed multi-precision QP division architecture
is implemented using Xilinx Virtex-7 FPGA device. The
functional verification of the proposed architecture is carried
out using 5-millions random test cases with various combi-
nations of operands, which produces a faithful rounded result
(max 1-ULP, unit at last place, precision loss). To the best of
author’s knowledge, literature does not contains any multi-
precision quadruple precision division architecture. Diniz et
al. [4] is only work available which has shown the results
for a single-mode quadruple precision division architecture
implementation on a FPGA device. A comparison is shown
against it in Table I, which shows that proposed work
provides better latency, throughput, area and speed metric
along with providing multi-precision support.
IV. CONCLUSIONS
This paper presented an iterative multi-precision quadru-
ple precision division architecture for the hardware accel-
erators, which is based on the series expansion division
methodology of mantissa division. Compared to the avail-
able literature, the proposed architecture out-performs them
in terms of area, speed, latency and throughput.
Acknowledgments: This work is party supported by
the “The University of Hong Kong” grant (Project Code.
201409176200), the “Research Grants Council” of Hong
Kong (Project ECS 720012E), and the “Croucher Innovation
Award” 2013.
REFERENCES
[1] D. H. Bailey, R. Barrio, and J. M. Borwein, “High-precision
computation: Mathematical physics and dynamics,” Applied
Mathematics and Computation, vol. 218, no. 20, pp. 10 106–
10 121, 2012.
[2] M. Jaiswal, R. Cheung, M. Balakrishnan, and K. Paul, “Series
expansion based efficient architectures for double precision
floating point division,” Circuits, Systems, and Signal
Processing, vol. 33, no. 11, pp. 3499–3526, 2014. [Online].
Available: http://dx.doi.org/10.1007/s00034-014-9811-8
[3] A. Karatsuba and Y. Ofman, “Multiplication of Many-Digital
Numbers by Automatic Computers,” in Proceedings of the
USSR Academy of Sciences, vol. 145, 1962, pp. 293–294.
[4] P. Diniz and G. Govindu, “Design of a field-programmable
dual-precision floating-point arithmetic unit,” in Field Pro-
grammable Logic and Applications, 2006. FPL ’06. Interna-
tional Conference on, Aug 2006, pp. 1–4.
