Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier
  Intellectual Property Core on FPGA by S, Arish & Sharma, R. K.
Circuits Syst Signal Process
DOI 10.1007/s00034-016-0335-2
Run-Time-Reconfigurable Multi-Precision
Floating-Point Matrix Multiplier Intellectual
Property Core on FPGA
S. Arish1 · R. K. Sharma1
Received: 25 June 2015 / Revised: 1 May 2016 / Accepted: 3 May 2016
© Springer Science+Business Media New York 2016
Abstract In today’s world, high-power computing applications such as image
processing, digital signal processing, graphics, robotics require enormous comput-
ing power. These applications use matrix operations, especially matrix multiplication.
Multiplication operations require a lot of computational time and are also com-
plex in design. We can use field-programmable gate arrays as low-cost hardware
accelerators along with a low-cost general-purpose processor instead of a high-
cost application-specific processor for such applications. In this work, we employ
an efficient Strassen’s algorithm for matrix multiplication and a highly efficient
run-time-reconfigurable floating-point multiplier for matrix element multiplication.
The run-time-reconfigurable floating-point multiplier is implemented with custom
floating-point format for variable-precision applications. A very efficient combination
of Karatsuba algorithm and Urdhva Tiryagbhyam algorithm is used to implement the
binary multiplier. This design can effectively adjust the power and delay requirements
according to different accuracy requirements by reconfiguring itself during run time.
Keywords FPGA · Run-time-reconfigurable · Variable-precision · Vedic
mathematics · Karatsuba
1 Introduction
Matrix multiplication is the most significant operation in high-power computing
applications. Multiplication operations require a lot of computational time, and the
B S. Arish
arishsu@gmail.com
1 School of VLSI Design and Embedded Systems, National Institute of Technology Kurukshetra,
Kurukshetra, Haryana, India
Author's personal copy
Cite as: Arish, S. and Sharma, R.K., "Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier 
Intellectual Property Core on FPGA", Circuits, Systems, and Signal Processing, 2017, pp. 998-1026. 
doi: 10.1007/s00034- 6-0335-2
Circuits Syst Signal Process
architecture of such multiplication units is complex. The complexity of conventional
matrix multiplication algorithm is O(N 3) [1], where N is the order of the matrix. For
such processing applications,we cannot use a low-cost general-purpose data processor.
An application-specific signal processor is necessary. Also, most matrix multiplica-
tion algorithms are implemented with sequential logic in microcontrollers or digital
signal processors (DSPs) using sequential languages such as C, C++ or with sequential
tools such asMATLAB. This sequential programming can increase the execution time
multi-fold. By using FPGAs as low-cost hardware accelerators, we can replace the
high-cost application-specific processor with a low-cost general-purpose processor.
Also by using a very efficient matrix multiplication algorithm like Strassen’s algo-
rithm [6] having a complexity of O(N 2.81), we can further reduce the complexity of
the architecture. Since FPGAs are highly parallel and reconfigurable, we can recon-
figure the hardware efficiently according to different applications and the execution
speed can be reduced significantly. Even though reconfigurability is the main feature
of FPGA, run-time-reconfigurability is an entirely different term. Reprogramming the
device again and again while usage is not feasible, and there comes the importance of
run-time-reconfigurability. If we can reconfigure the device on the run according to
the application requirements, it can be a very efficient solution. Floating-point units
can be used instead of fixed point units to increase the flexibility and the range of
operation.
A lot of effort has been made over the past few decades to improve the performance
of floating-point computations. Floating-point units not only are complex but also
require more area and hence consume more power as compared to fixed point units.
The complexity of the floating-point unit increases when we design for high accuracy.
Even a minute error in accuracy can cause major consequences. These errors are
possible in floating-point units mainly because of the discrete behaviour of the IEEE-
754 [11] floating-point representation, where fixed number of bits is used to represent
numbers. Due to the high computational requirements of scientific applications such
as computational geometry, climate modelling, computational physics, it is necessary
to have extreme precision in floating-point calculations. And this increased precision
may not be provided with single-precision or double-precision floating-point format.
These factors further increase the complexity of the unit. But some applications do
not require high precision. Even an approximate value will be sufficient for the correct
operation. For applications which require lower precision, the use of double-precision
or quadruple-precision floating-point units will be a luxury. It wastes area, power and
also increases latency. For devices such as portable orwearable deviceswhere accuracy
requirement varies with different applications and also where power consumption is
a crucial factor, the use of high-precision floating-point multipliers is not a good
option. In such cases, a variable-precision multiplier can be handy, which can save
much power and time when the application does not require high precision. A power
efficient design of floating-point multiplier with different modes of accuracy selection
is presented and proposed in this paper. With different precision modes, the mode
which is most appropriate for the concerned application can be selected. As accuracy
or precision requirement decreases, the width of the multiplier decreases and hence
the power consumption and latency.
Author's personal copy
Circuits Syst Signal Process
2 Literature Review
There are many pieces of literature available on matrix multiplication on the FPGA-
based platform and also a few available on floating-point matrix multiplication.
The mathematical model for the matrix multiplication algorithm based on Baugh–
Wooley algorithm is described in the paper [1]. This model also uses a systolic
architecture with parallel processing elements. But as matrix size increases, delay
significantly increases. These architectures are useful for multiplication of sub-
matrices in Strassen algorithm. An excellent review of different matrix multiplication
algorithms including classical matrix multiplication algorithms, Winograd algorithm
and Strassen algorithm is explained in detail in the paper [6]. Matrix multiplica-
tion described in [5] is for 3D affine transformations, and the main features are
the proposed floating-point (FP) multiplier and adder, where it uses a pipelined
FP multiply–accumulate unit. A codesign approach for matrix multiplication using
conventional algorithm is described in paper [14]. It uses parallel adders and mul-
tipliers. Since it uses the conventional algorithm, significant improvement in delay
cannot be guaranteed. A floating-point matrix multiplication algorithm using a sys-
tolic architecture is explained in paper [21]. A parallel implementation of large matrix
multiplier coprocessor is described in paper [4]. A highly parallel matrix product
coprocessor is explained in [8], and it is also suitable for large matrix multiplica-
tion. But most of these pieces of literature do not really concentrate on variable
precision or reconfigurability. The proposed paper focus mainly on reconfigurabil-
ity.
Different variable-precision floating-point methods are explained in papers [2,7,
15]. A multi-mode floating-point multiplier, which operates efficiently with every
precision format specified by the IEEE 754-2008 standard, is presented in paper [15].
The proposed multiplier in paper [15] is pipelined to achieve execution of one quadru-
ple multiplication in 3 cycles. A model and implementation of a versatile multiplier,
which can perform either double-precision (paired) or single-precision floating-point
multiplications, or 16-bit or 8-bit SIMD integer (vector) multiplications, is presented
in paper [7]. An efficient variable-precision floating-point multiplier design which
requires very less storage area and also uses parallel additions alongwithmultiplication
is described in paper [2]. All of these studies illustrate very efficient variable-precision
designmethods, but run-time-reconfigurability is not themain feature of thesemodels.
Readily available IPs such as DSP blocks are also used often.
A combination of Karatsuba algorithm [16] and Urdhva Tiryagbhyam algorithm
is used to implement floating-point multiplier in the proposed model. References
[11,17] give an idea of floating-point formats. The proposed paper follows the
IEEE-754 standard as described in [11]. A very good explanation of various floating-
point formats and multiplier algorithms is given in [17]. The implementation of
Karatsuba algorithm is described in papers [2,16]. A simple implementation of
Karatsuba algorithm is explained in paper [16]. Different Vedic mathematics meth-
ods including Urdhva Tiryagbhyam algorithm are described in reference [23]. A
reduced-bit multiplication algorithm based on Urdhva Tiryagbhyam algorithm is
described in [9]. Reference [9] gives the implementation of 4 × 4 bit multiplier, but
this literature provides a better-optimized hardware architecture of Urdhva Tiryagb-
Author's personal copy
Circuits Syst Signal Process
hyam algorithm. Various efficient multiplication algorithms used in floating-point
multipliers are described in papers [12,13,20]. A high-speed binary floating-point
multiplier based on Dadda algorithm is presented in paper [13]. A Wallace tree
multiplier using Booth recoder for fast arithmetic circuits on FPGA is proposed
in [20], and it gives an improved version of tree-based Wallace tree multiplier
architecture. Architecture for a fast 32-bit floating-point multiplier compliant with
the single-precision IEEE 754-2008 standard is proposed in [12], and this design
intends to make the multiplier faster by implementing adders having the least power
delay constant, which helps in reducing the delay caused by the propagation of the
carry.
Papers [3,18,19] deal with the binary multiplication based on Vedic mathemat-
ics. Paper [22] describes an efficient implementation of an IEEE 754 single-precision
floating-point multiplier using Vedic mathematics. A VLSI implementation of low-
power 16 × 16 bit multiplier using Vedic mathematics method is described in paper
[3]. An excellent hardware implementation of Urdhva Tiryagbhyam algorithm is illus-
trated in this article. An 8 × 8 bit multiplier using Urdhva Tiryagbhyam algorithm is
presented in paper [19]. It gives a better hardware implementation in terms of delay.
Some features of some of these papers are adopted for the proposed model. Also, a
comparison of those papers with the proposed Karatsuba–Urdhva algorithm is made
regarding delay and area and observed that the proposed model is a better implemen-
tation.
3 Design of Proposed Model
The proposed model is a run-time-reconfigurable floating-point matrix multiplier,
which can be used for high-speed andmultiple precision applications. The basic unit of
themodel is the processing element (PE),which is a full-fledged 2×2matrixmultiplier.
The design is made highly parallel with PE as the core processing unit. The model
uses Strassen algorithm as the matrix multiplication algorithm. Strassen algorithm
divides the operand matrices into sub-matrices of order 2 × 2, and the PE performs
calculations on these sub-matrices. The model is made highly parallel such that all the
sub-matrixmultiplication operations for a 4×4matrix will be executed in parallel, and
hence, it will take only the execution time of a 2× 2 matrix multiplication to multiply
a 4 × 4 matrix. This parallel execution saves the execution time by more than eight
times because a 4 × 4 matrix will have 64 multiplication and 48 addition operations,
whereas a 2 × 2 matrix multiplication operation takes only eight multiplication and
four addition operations. By using Strassen algorithm, the number of multiplications
is reduced, which further increases the efficiency. The block diagram of the proposed
model is shown in Fig. 1.
The proposed model has two levels. The top level is the Strassen algorithm, and
the bottom level is the proposed run-time-reconfigurable multi-precision floating-
point multiplier. Strassen algorithm is used for matrix operations, and run-time-
reconfigurable multi-precision floating-point multiplier is used to multiply matrix
elements inside the processing element. The main blocks of the proposed model are
explained in detail in the following sections.
Author's personal copy
Circuits Syst Signal Process
Fig. 1 Block diagram of the proposed model
3.1 Strassen Algorithm
Strassen algorithm was introduced by Strassen in 1969, which required fewer mul-
tiplications of matrix elements than the classical matrix multiplication approach [6].
It is faster than the standard matrix multiplication algorithm, but slower than the
fastest algorithm, Coppersmith–Winograd algorithm. Even though Coppersmith–
Winograd algorithm and its derived algorithms are faster, it is not feasible to implement
Coppersmith–Winograd algorithm practically. Coppersmith–Winograd algorithm is
frequently used as a buildingblock inother algorithms toprove theoretical timebounds,
and hence, it is not used in practice. Coppersmith–Winograd algorithm only provides
an advantage for matrices which are so large that they cannot be processed by modern
hardware.
Strassen showed that a 2×2matrixmultiplication can be performed onlywith seven
multiplications and 18 additions or subtractions. Strassen’s algorithm is a divide-and-
conquer algorithm which partitions each of the operand matrices into sub-matrices of
equal size and employs a divide-and-conquer strategy. It divides any n×n matrix into
sub-matrices, and each sub-matrix is of dimension n2 , where n is the size of the matrix.
It is the first algorithm to break the n3 ‘barrier’.
Let A, B be the two matrices of size 4 × 4. Matrix A is divided into sub-matrices
A0, A1, A2 and A3 of size 2 × 2, and matrix B is divided into sub-matrices B0, B1,
B2 and B3 of size 2 × 2. Let the product matrix be C. Let C0, C1, C2 and C3 be the
sub-matrices of matrix C, which are also of size 2 × 2. C0, C1, C2 and C3 can be
determined from Eq. (1).
Author's personal copy
Circuits Syst Signal Process
Fig. 2 Strassen’s algorithm
illustration
C0 = A0 · B0 + A1 · B2
C1 = A0 · B1 + A1 · B3
C2 = A2 · B0 + A3 · B2
C3 = A2 · B1 + A3 · B3
⎫
⎪⎪⎬
⎪⎪⎭
(1)
The process is illustrated in Fig. 2.
Let matrix A0 =
(
a11 a12
a21 a22
)
and matrix B0 =
(
b11 b12
b21 b22
)
.
Then, the partial product matrix P is obtained as,
P = A0 · B0, where P =
(
p11 p12
p21 p22
)
i.e.
(
p11 p12
p21 p22
)
=
(
a11 a12
a21 a22
)
·
(
b11 b12
b21 b22
)
The partial products are obtained as shown in operation (2)
S1 = (a11 + a22) · (b11 + b22)
S2 = (a21 + a22) · b11
S3 = a11 · (b12 − b22)
S4 = a22 · (b21 − b11)
S5 = (a11 + a12) · b22
S6 = (a21 − a11) · (b11 + b12)
S7 = (a12 − a22) · (b21 + b22)
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎭
(2)
The product matrix elements are obtained using Eq. (3) as follows.
p11 = S1 + S4 − S5 + S7
p12 = S3 + S5
p21 = S2 + S4
p11 = S1 − S2 + S3 + S6
⎫
⎪⎪⎬
⎪⎪⎭
(3)
From Eq. (2), it can be noticed that Strassen’s algorithm requires only seven mul-
tiplication operations to compute the result for a second-order matrix, whereas the
conventional algorithm requires eight multiplications. Hence, the number of multipli-
cations ‘M’ required for a nth-order matrix according to Strassen algorithm is
M (n) = 7M
(n
2
)
, M (1) = 7 for n ≥ 2 (4)
Author's personal copy
Circuits Syst Signal Process
where M
( n
2
)
is the multiplication function of sub-matrices of order n2 × n2 .
The complexity of Strassen algorithm can be written as
T (n) = 7T
(n
2
)
+ cn2 for n ≥ 2 (5)
Each sub-matrix is of size n2 × n2 , and there are 7 multiplication operations. Adding
the matrices together will take cn2 steps for some fixed constant c (because a matrix
has n2 entries).
By applying Master theorem [10], Eq. (5) works out to be
T (n) = Θ
(
7log2 n
)
= Θ
(
nlog2 7
)
= O
(
n2.81
)
(6)
The complexity of Strassen algorithm is O(n2.81), which is better than the classical
algorithm.
In classicalmultiplication algorithm, it takes 8multiplication operations to compute
the product of two 2×2matrices. The equations for productmatrix elements are shown
in Eq. (7).
p11 = (a11 · b11) + (a12 · b21)
p12 = (a11 · b12) + (a12 · b22)
p21 = (a21 · b11) + (a22 · b21)
p22 = (a21 · b12) + (a22 · b22)
⎫
⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎭
(7)
Hence, according to the classical algorithm, a nth-order matrix has a complexity
Θ(nlog2 8) = O(n3).
There are two options to be able to use bigger matrices: bottom-up and top-down.
Bottom-up In thismethod, the twomatrices of sizen (n = 2p,where p > 1) are divided
into sub-matrices of size 2q , q < p. A smaller matrix of size m × m (m = 2p−q) is
obtained. Themultiplication of these twomatrices—the external algorithm—is carried
out using the classical algorithm, while the two matrices of size 2q ×2q are multiplied
by using the Strassen algorithm—internal algorithm.
Top-down This method changes the methods used in the bottom-up design. It uses
Strassen as external algorithm and the classical multiplication as the internal algo-
rithm. Bottom-up design is not working viable with an FPGA because the number
of multiplications needed for the internal algorithm is too high. On the other hand,
the top-down alternative is suitable. Besides, this method allows a pipelined hardware
structure, improving the performance.
A variant of this top-down algorithm is shown in operation (8). It allows for starting
the multiplication before all the coefficients of the matrix are already found out. For
this to be possible, all the terms shown in (9) must be changed to α and β as given in
operation (9).
Author's personal copy
Circuits Syst Signal Process
S1i j =
m∑
k=1
α1ikβ
1
k j
S2i j =
m∑
k=1
α2ik · b2k−1,2 j−1
S3i j =
m∑
k=1
a2k−1,2 j−1 · β2k j
S4i j =
m∑
k=1
a2k,2 j · β3k j
S5i j =
m∑
k=1
α3ik · b2k,2 j
S6i j =
m∑
k=1
α4ik · β4k j
S7i j =
m∑
k=1
α5ik · β5k j
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭
(8)
α1ik = a2i−1,2k−1 + a2i,2k
α2ik = a2i,2k−1 + a2i,2k
α3ik = a2i−1,2k−1 + a2i−1,2k
α4ik = a2i,2k−1 − a2i−1,2k−1
α5ik = a2i−1,2k − a2i,2k
β1k j = b2k−1,2 j−1 + b2k,2 j
β2k j = b2k−1,2 j − b2k,2 j
β3k j = b2k,2 j−1 − b2k−1,2 j−1
β4k j = b2k−1,2 j−1 + b2k−1,2 j
β5k j = b2k,2 j−1 − b2k,2 j
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭
(9)
3.2 Processing Element (PE)
The proposed model uses multiple processing elements (PEs) in parallel, where each
processing element is a full-fledged 2 × 2 matrix multiplier. Each PE has two lev-
els. The top level is the matrix multiplication algorithm, and the bottom level is the
run-time-reconfigurablemulti-precision floating-pointmultiplier. Thematrixmultipli-
cation algorithm used is Strassen algorithm as in the top level of the proposed model.
Run-time-reconfigurable multi-precision floating-point multiplier at the bottom level
multiplies the individual matrix elements. The block diagram of the PE is shown in
Fig. 3. Matrix A andmatrix B are the input matrices, andmatrixC is the output product
matrix. Various blocks in the PE are explained in the following sections.
3.2.1 Input and Output Registers
The registers are used to store the input and output matrix elements. It has 12 registers
of 64 bit: Eight registers are used to store the input and four registers are used to
Author's personal copy
Circuits Syst Signal Process
Fig. 3 Block diagram of processing element
store the output of the PE. It also has a 3-bit input register to store the precision select
input.
3.2.2 Control Logic
The control logic is used to synchronize and control the operations. Two control
signals, ready and reset, are used to control the execution of processes by the processing
element. The control logic is effective mostly in the matrix element multiplication, i.e.
in the run-time-reconfigurable multi-precision floating-point multiplier unit.
3.2.3 Alpha and Beta Calculation
Alpha and beta calculation unit is the part of the implementation of Strassen algorithm.
Alpha (α1ik toα
5
ik) and beta (β
1
k j toβ
1
k j ) values are calculated using thematrix elements,
and these values are given to the partial product calculation unit to calculate partial
products.
3.2.4 Partial Product Calculation
This is also a unit corresponding to Strassen algorithm implementation. The cor-
responding alpha and beta values from alpha and beta calculation unit is used to
Author's personal copy
Circuits Syst Signal Process
Fig. 4 Block diagram of run-time-reconfigurable variable-precision floating-point multiplier
calculate partial products S1i j to S
7
i j . These partial products are calculated using the
run-time-reconfigurable multi-precision floating-point multiplier according to the pre-
cision select input given to the PE. The partial product values are then provided to the
product calculation unit.
3.2.5 Product Calculation
The partial product values obtained from the previous block is used to calculate the
complete result of the PE. The final result is calculated by the addition/subtraction of
partial products according to the Strassen algorithm.
3.3 Run-Time-Reconfigurable Multi-Precision Floating-Point Multiplier
The proposed run-time-reconfigurable multi-precision floating-point multiplier per-
forms the multiplication of matrix elements in the processing element. Multiplication
ofmatrix elements in double-precision floating-point format is carried out according to
the different precision requirements of the output. Since it performs variable-precision
multiplication, it plays a significant role in the proposed model. The basic block of
run-time-reconfigurablemulti-precision floating-pointmultiplier is a double-precision
floating-point unit. According to the precision required at the output, the size of the
mantissa is varied in the floating-point multiplication operation. The block diagram
of the model is shown in Fig. 4.
Run-time-reconfigurable multi-precision floating-point multiplier accepts two 67-
bit operands and ready and reset signals as inputs. The 67-bit input is in a modified
floating-point format where the first 3 bits aremode/precision select bits and the rest 64
bits are the operand to be multiplied in IEEE double-precision floating-point format.
The precision select bits determine the precision and mantissa size of the floating-
Author's personal copy
Circuits Syst Signal Process
Fig. 5 Modified floating-point format used in the model
pointmultiplier to be chosen. According to the precision selected, the double-precision
format input is truncated to custom mantissa size floating-point formats.
Once the precision mode is selected, the input mantissa is truncated and rounded
using the special rounding scheme used as explained in Sect. 3.3.4. This new rounding
scheme ensures the least variation in results. The truncated input is given to thefloating-
point multiplier which uses multiple binary multipliers of different word lengths, from
which a proper binary multiplier unit is selected based on the precision required and
the rest of the multipliers are disabled and shut down. Only the required multiplier
will be ON, and this ensures the least consumption of power. Once the multiplication
is done, the result is made available at the output in double-precision floating-point
format.
3.3.1 Modified Floating Point Format
For the purpose of run-time reconfiguration, the standard IEEE floating-point format
is altered to fit the proposed model. The modified format we used is shown in Fig. 5.
The multiplier accepts two inputs of 67 bits. The first 3 bits (66th bit to 64th bit) are
the mode select bits. Mode select bits are used for selecting the appropriate precision
mode for the application. The value of the mode select bits for both the inputs must be
the same. Otherwise, a mode select error signal will be generated and the execution
will be halted. The rest of the bit sizes are same as the IEEE-754 floating-point format.
The 63rd bit is the sign bit, next 11 bits from 62nd bit to 52nd bit are the exponent
bits, and the last 52 bits are mantissa bits.
In custom precision formats for different modes, the basic double-precision for-
mat is used with variable mantissa sizes. The different custom precision formats for
different precision modes are shown in Fig. 6. Since mantissa multiplication is the
only complex operation, it is done separately according to the mantissa sizes and
all other operations are same as double-precision floating-point operations. All these
custom precision operations are performed by the device itself, and hence, users can-
not access or change the custom precision formats. Only the end result, which is of
double-precision floating-point format, is outputted.
Even though it uses a modified floating-point format, it does not alter the basic
double-precision floating-point format and floating-point operations. Also, the output
will be of IEEE double-precision floating-point format. The only changewhich is done
with the IEEE format is that the three precision select bits are appended as the most
significant bits. These precision select bits can either be generated by an application
program or can be taken as a preset value for a particular application. While giving
the inputs for multiplication, the precision select bits are appended to the double-
precision format by the application program. The application program can be any
Author's personal copy
Circuits Syst Signal Process
Fig. 6 Customprecisionfloating-point formats used in themodel.aCustomprecision format 8-bitmantissa,
b custom precision format 16-bit mantissa, c single-precision format, d custom precision format 32-bit
mantissa
software program, for example the automatic resolution/bit-rate selection of internet
video streaming based on the connection speed.
3.3.2 Inputs and Outputs
The run-time-reconfigurable multi-precision floating-point multiplier accepts four
inputs and gives out six output signals. The inputs are the two operands, a ready
and a reset signal. The the two operands are 67-bit wide. The outputs of the mod-
ule are the product, mode select error signal, and four special values of the product,
namely zero, infinity, not-a-number (NaN) and denormals. The product is of 64-bit-
wide double-precision floating-point format, and all other output signals are of single
bit.
3.3.3 Modes of Operation
The different mode select bit combinations for different modes are shown in Table 1.
The various modes in the proposed multi-precision multiplier are the following. The
modes are selected according to the mode select bits.
Mode1:Mode1 is auto-mode, i.e. the controller itselfwill select the optimummode
by analysing the inputs and will start execution. The optimum mode is selected
by counting the number of zeroes after a leading 1. If the number of zeroes is 6 or
more after a leading 1, then the bits up to that leading 1 is counted. If the number of
bits up to that leading 1 is<8, thenmode 2 or 8-bit mantissa mode will be selected.
If the number of bits before the leading 1 is <16, 16-bit mantissa mode will be
selected and so on. The flow chart of auto-mode selection is shown in Fig. 7.
Author's personal copy
Circuits Syst Signal Process
Table 1 Mode select bits for different modes
Mode Mode selection bits Precision (mantissa)
Mode 1 (auto-mode) 000 According to input
Mode 2 001 8-bit
Mode 3 010 16-bit
Mode 4 011 23-bit
Mode 5 100 36-bit
Mode 6 101 52-bit
Fig. 7 Flow chart of auto-mode selection
Mode 2: This is a customprecision format. It uses a basic double-precision floating-
point multiplier with a mantissa size of 8 bits.
Mode 3: This is a customprecision format. It uses a basic double-precision floating-
point multiplier with a mantissa size of 16 bits.
Mode 4: This is a customprecision format. It uses a basic double-precision floating-
point multiplier with a mantissa size of 23 bits.
Author's personal copy
Circuits Syst Signal Process
Fig. 8 Rounding bits and mantissa of possible result
Mode 5: This is a customprecision format. It uses a basic double-precision floating-
point multiplier with a mantissa size of 36 bits.
Mode 6: Thismode is a full-fledged double-precision floating-pointmultiplier with
accuracy at its best.
The modes with less number of mantissa bits are faster and consume less amount of
power. These modes are best suited for integer multiplication and also for applications
where accuracy is not a big issue. Rounding of bits is done before multiplication for
every mode except for mode 6, and this reduces huge variations in results.
3.3.4 Truncation and Rounding
Truncation and rounding operations are done before and after multiplication except in
mode 6. In mode 6, truncation and rounding are done only after multiplication. In all
other custom precision modes, truncation of inputs is done according to the precision
mode selected and rounding is done using the special rounding scheme developed.
This special rounding scheme uses four bits instead of the conventional guard (G),
round (R) and sticky bits (T). The additional bit used is termed as extra bit (E). The
mantissa of a possible result with rounding bits is shown in Fig. 8.
The rounding scheme used is a round-up scheme, i.e. the bit ‘rnd’ is added to the
LSB of the mantissa ‘L’. The value of ‘L’ is calculated according to the least change
in value determined when truncated. After careful evaluation of rounding bits, ‘rnd’
is calculated as in Eq. (10).
rnd = G& (R |T | E) (10)
This value or ‘rnd’ gives the least change in accuracy when rounded.
3.3.5 Floating-Point Multiplier
A floating-point number is represented in the IEEE-754 format [11] as ±s × be or
±significand× baseexponent. To perform multiplication of two floating-point numbers
±s1 × be1 and ±s2 × be2, the significand or mantissa of the numbers are multiplied
to get the product mantissa, and exponents are added to get the product exponent, i.e.
the product is ±(s1 × s2) × b(e1+e2). The various conditions of special values and
exceptions must be considered while multiplying two floating-point numbers. The
hardware block diagram of floating-point multiplier is shown in Fig. 9.
In the proposed model, multiple precision is required and hence the floating-point
multiplier used is different from the conventional floating-pointmultiplier and is shown
Author's personal copy
Circuits Syst Signal Process
Fig. 9 Floating-point multiplier
block diagram
Fig. 10 Block diagram of floating-point multiplier model in the proposed design
in Fig. 10. It uses multipliers of different precisions according to the mode selected.
The unused units are shut down to save power. The inputs A and B shown in Fig. 10
are of double-precision floating-point format and are truncated according to the mode
selected before giving to the floating-point multiplier.
The important blocks in the implementation of proposed floating-point multiplier
are described in detail in the following sections.
Author's personal copy
Circuits Syst Signal Process
3.3.5.1. Sign Calculation The MSB of floating-point number represents the sign bit.
The sign of the product will be positive if both the numbers are of the same sign and
will be negative if numbers are of opposite sign. So, to obtain the sign of the product,
a simple XOR gate can be used as the sign calculator.
3.3.5.2. Addition of Exponents The input exponents are added together to get the prod-
uct exponent. Since a bias is used in the floating-point format exponent, it is required
to subtract the bias from the sum of exponents to get the actual exponent. The value of
bias is 12710(011111112) for the single-precision format and 102310(01111111112)
for the double-precision format. In the proposed customprecision format also, a bias of
102310 is used. The computational time of mantissa multiplication operation is much
more than the exponent addition. So a simple ripple carry adder and ripple borrow
subtracter are optimal for exponent addition.
3.3.5.3. Karatsuba–Urdhva Tiryagbhyam Binary Multiplier for Mantissa Multiplica-
tion In floating-point multiplication, most important and complex part is the mantissa
multiplication. Multiplication operation requires more time compared to addition
operation. And as the number of bits increases, it consumes more area and time.
In double-precision format and single-precision format, a 53 × 53 bit multiplier and
a 24 × 24 bit multiplier are required, respectively, for mantissa multiplication. It
requires much time to perform these operations, and it is the major contributor to
the delay of the floating-point multiplier. The binary multiplier must be designed and
implemented efficiently to reduce the area and delay constraints of the floating-point
multiplier because binary multiplication is the most area and time-consuming opera-
tion as compared to addition. Since the proposed model uses multiple multipliers, the
multiplication algorithm must be very much efficient.
To make the multiplication operation more area efficient and faster, a combination
of Karatsuba algorithm [16] and Urdhva Tiryagbhyam algorithm [23] is used in the
proposed model. A better and efficient implementation is done by combining the fea-
tures of both these algorithms.Theproblemwith conventionalmultiplication algorithm
and Booth algorithm is that area increases drastically with increase in word length.
Karatsuba algorithm and Urdhva Tiryagbhyam algorithm are good at their aspects, but
have limitations too. Karatsuba algorithm uses a divide-and-conquer approach where
it breaks down the inputs into most significant half and least significant half, and this
process continues until the operands are 8-bit wide. Karatsuba algorithm is best suited
for operands of higher bit length. But at lower bit lengths, it is not as efficient as it is
at higher bit lengths.
Urdhva Tiryagbhyam algorithm is the best algorithm for binary multiplication
regarding area and delay. But the partial products are added in a ripple manner in
this algorithm, and hence as the number of bits increases, the delay also increases. For
example, for a 4-bit multiplication, it requires six adders connected in a ripple manner.
And an 8-bit multiplication requires 14 adders and so on. Compensating the delay will
cause an increase in area. So Urdhva Tiryagbhyam algorithm is not that optimal if the
number of bits is high.
To eliminate the limitations of Karatsuba algorithm, Urdhva Tiryagbhyam algo-
rithm can be used at the lower stages where it is more efficient for multiplication of
Author's personal copy
Circuits Syst Signal Process
Fig. 11 Karatsuba–Urdhva multiplier model
lower bit lengths. If Karatsuba algorithm is used at higher stages and Urdhva Tiryagb-
hyam algorithm is used at lower stages, it can somewhat compensate the limitations
of both the algorithms and hence the multiplier becomes more efficient. The circuit is
further optimized by using carry select and carry save adders instead of ripple carry
adders. The usage of carry save adders reduces the delay to a great extent with minimal
increase in hardware. These two algorithms are explained in detail in the following
sections for better understanding. Themodel ofKaratsuba–UrdhvaTiryagbhyam algo-
rithm is shown in Fig. 11.
The two inputs A and B are two n-bit binary numbers. The inputs are divided into
most significant half and least significant half until the two inputs are 8-bit wide. If
‘n’ is an even number and also after each division if the most and least significant bits
are even, the algorithm uses Eq. (18). If the number of bits ‘n’ is odd or the division
causes odd bit operands, the algorithm uses Eq. (20) for further steps. These 8-bit-
wide most significant and least significant half operands are multiplied using Urdhva
Tiryagbhyamalgorithm.Finally, after getting all the sub-multiplication results, shifting
of the results is done instead of multiplication by the powers of 2 as in Eqs. (18) or
(20), and the results are added to get the final result.
There are efficient multiplication algorithms such as Booth and modified Booth
algorithm, but the area requirement increases drastically when the number of operand
bits increases. This proposed algorithm is very much efficient in terms of area when
compared to other faster algorithms if the number of bits is high.
Urdhva Tiryagbhyam Algorithm for Multiplication Urdhva Tiryagbhyam sutra is an
ancient Vedic mathematics method for multiplication. It is a general formula applica-
ble to all cases of multiplication. The formula is very short and consists of only one
Author's personal copy
Circuits Syst Signal Process
Fig. 12 Line notation of
Urdhva Tiryagbhyam sutra
compound word and means ‘vertically and crosswise’. In Urdhva Tiryagbhyam algo-
rithm, the number of steps required for multiplication can be reduced and hence the
speed of multiplication can be increased. An illustration of steps for computing the
product of two 4-bit numbers is shown in Eq. (11). The two inputs are a3a2a1a0 and
b3b2b1b0, and let p7 p6 p5 p4 p3 p2 p1 p0 be the product. The temporary partial prod-
ucts are t0, t1, t2, . . . and t6. The partial products are obtained from the steps illustrated
below. The line notation of the steps is shown in Fig. 12.
Step1: t0 (1 bit) = a0b0
Step2: t1 (2 bit) = a1b0 + a0b1 = t1 [1] t1 [0]
Step3: t2 (2 bit) = a2b0 + a1b1 + a0b2 = t2 [1] t2 [0]
Step4: t3 (3 bit) = a3b0 + a2b1 + a1b2 + a0b3 = t3 [2] t3 [1] t3 [0]
Step5: t4 (2 bit) = a3b1 + a2b2 + a1b3 = t4 [1] t4 [0]
Step6: t5 (2 bit) = a3b2 + a2b3 = t5 [1] t5 [0]
Step7: t6 (1 bit) = a3b3
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭
(11)
where tn[2], tn[1], tn[0] are the partial product bits according to binary positional
weight.
The product is obtained by adding s1, s2 and s3 as shown in Eq. (12), where
s1, s2 and s3 are the partial sum obtained.
s1 = t6t5 [0] t4 [0] t3 [0] t2 [0] t1 [0] t0
s2 = t5 [1] t4 [1] t3 [1] t2 [1] t1 [1]
s3 = t3 [2]
⎫
⎪⎪⎪⎬
⎪⎪⎪⎭
(12)
Author's personal copy
Circuits Syst Signal Process
This method can be further optimized to reduce the number of hardware. A more
optimized hardware architecture [9,19] is shown in Fig. 13. This model actually helps
to eliminate the need for three-operand 7-bit adder and hence reduces hardware and
delay. The adders are connected in a ripple manner.
The expressions for product bits are as shown in Eq. (13).
p0 = a0b0
p1 = LSBof (Sum (ADDER1)) = LSBof (a1b0 + a0b1)
p2 = LSBof (Sum (ADDER2)) = LSBof (MSB(ADDER1) + a2b0 + a1b1 + a0b2)
p3 = LSBof (Sum (ADDER3)) = LSBof (MSB(ADDER2) + a3b0 + a2b1 + a1b2 + a0b3)
p4 = LSBof (Sum (ADDER4)) = LSBof (MSB(ADDER1) + a3b1 + a2b2 + a1b3)
p5 = LSBof (Sum (ADDER5)) = LSBof (MSB(ADDER1) + a3b2 + a2b3)
p6 = LSBof (Sum (ADDER6)) = LSBof (MSB(ADDER1) + a3b3)
p7 = Carry ofADDER
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭
(13)
Since there are more than two operands in adders 2–5, we can use carry save
addition to implement these adders. This technique reduces the delay to a great extent
compared to the ripple carry adder.
Karatsuba Algorithm for Multiplication Karatsuba multiplication algorithm is best
suited for multiplying very large numbers. This method was introduced by Anatoli
Karatsuba in 1962. It is a divide-and-conquer method, in which we divide the num-
bers into their most significant half and least significant half and then multiplication is
performed. Karatsuba algorithm reduces the number of multipliers required by replac-
ing multiplication operations by addition operations. Additions operations are faster
than multiplication operations, and hence the speed of the multiplier is increased. As
the number of bits of inputs increases, Karatsuba algorithm becomes more efficient.
This algorithm is optimal if the width of inputs is more than 16 bits. The hardware
architecture of Karatsuba algorithm is shown in Fig. 14.
Karatsuba algorithm for two inputs X and Y can be explained as follows:
Product = X · Y
X and Y can be written as,
X = 2n/2 · Xl + Xr (14)
Y = 2n/2 · Yl + Yr (15)
Author's personal copy
Circuits Syst Signal Process
F
ig
.1
3
H
ar
dw
ar
e
ar
ch
ite
ct
ur
e
of
4
×
4
U
rd
hv
a
T
ir
ya
gb
hy
am
m
ul
tip
lie
r
Author's personal copy
Circuits Syst Signal Process
Fig. 14 Hardware architecture
of Karatsuba algorithm
Input X Input Y
2n bit Shift
n bit Shift
. ( + ).
( + )
.
Subtracter
Adder
Product
where Xl ,Yl and Xr , Yr are the more significant half and less significant half of X
and Y , respectively, and n is the number of bits.
Then,
X · Y =
(
2
n
2 · Xl + Xr
)
·
(
2
n
2 · Yl + Yr
)
= 2n · XlYl + 2n/2 (XlYr + XrYl) + XrYr (16)
The second term in Eq. (16) can be optimized to reduce the number of multiplication
operations.
i.e.; XlYr + XrYl = (Xl + Xr ) (Yl + Yr ) − XlYl − XrYr (17)
Equation (16) can be rewritten as,
X · Y = 2n · XlYl + XrYr + 2 n2 ((Xl + Xr ) (Yl + Yr ) − XlYl − XrYr ) (18)
The recurrence of Karatsuba algorithm is as follows,
T (n) = 3T
(n
2
)
+ O (n) ≈ˆ O
(
n1.585
)
(19)
Even though these two algorithms are very straightforward to implement, some
modifications are needed in the equations to make it easier to implement. The equa-
tions of Karatsuba algorithm work well for operands with even number of bits. When
the operand bits are odd, division into more significant half and less significant half
becomes difficult. The following set of equations gives the operation of operands with
odd number of bits, which completely follows the actual method.
Author's personal copy
Circuits Syst Signal Process
Let the number bits of two operands X,Y be n, which is an odd number. Let us
define two integers f and s, where
f = n
2
, integer part only.
s = n − n
2
The length of Xl ,Yl and Xr ,Yr is f and s, respectively.
Let XlYl = p1, XrYr = p2, (Xl + Xr )(Yl + Yr ) = p3. From Eq. (18), the result
of multiplication can be written as
X · Y = 22s · (p1) + p2 + 2s · (p3 − p2 − p1) (20)
Normalization of the Result Floating-point representations have a hidden bit in the
mantissa, which always has a value 1, and hence it is not stored in the memory for
saving one bit. A leading ‘1’ in the mantissa is considered to be the hidden bit, i.e.
the ‘1’ just immediate to the left of decimal point. Usually, normalization is done
by shifting so that the MSB of mantissa becomes nonzero, and nonzero means ‘1’
in radix-2 representation. The decimal point in the mantissa multiplication result is
shifted left if the leading ‘1’ is not to the immediate left of decimal point. And for
each left shift operation of the result, the exponent value is incremented by one. This
is called normalization of the result. Since the value of the hidden bit is always 1, it is
called ‘hidden 1’.
Representation of Exceptions Some of the numbers cannot be represented with a
normalized significand.To represent those numbers, a special code is assigned to it. The
proposed model uses double-precision floating-point format, and we use four output
signals, namely zero, infinity, not-a-number (NaN) and denormals to represent these
exceptions. If the product has exponent+bias = 0 and significand = 0, then the result
is taken as zero (±0). If the product has exponent+ bias = 1023 and significand = 0,
then the result is taken as infinity (∞). If the product has exponent+bias = 1023 and
significand = 0, then the result is taken as NaN. Denormalized values or denormals
are numbers without a hidden 1 and with the smallest possible exponent. Denormals
are used to represent certain small numbers that cannot be represented as normalized
numbers. If the product has exponent + bias = 0 and significand = 0, then the result
is represented as denormal. Denormal is represented as ±0.s × 2−511, where s is the
significand or mantissa.
4 Implementation and Results
This project is programmed in Verilog HDL, synthesized and simulated using Xil-
inx synthesis tools (Xilinx ISE 14.7), targeted on Xilinx Virtex 5 ML10 Evaluation
platform. We use a bottom-up design process to implement this project. The design
is done in three stages. The first stage is the design and testing of Karatsuba–Urdhva
Author's personal copy
Circuits Syst Signal Process
Table 2 Performance analysis of various-word-length Karatsuba–Urdhva Tiryagbhyam binary
multipliers
8-bit
multiplier
16-bit
multiplier
24-bit
multiplier
32-bit
multiplier
53-bit
multiplier
Slices 135 442 1259 1381 4587
LUTs 89 337 1013 1138 3891
IOBs 33 65 97 129 213
Delay (ns) 5.794 7.333 8.411 9.027 10.213
fmax (MHz) 352.678 330.940 309.449 292.950 255.213
Table 3 Comparison of 8-bit multipliers with the proposed multiplier
Reference [18] Reference [19] Reference [22] Proposed multiplier
Width 8-bits 8-bits 8-bits 8-bits
Delay (ns) 28.27 15.050 23.973 9.396
Table 4 Comparison of 16-bit multipliers with the proposed multiplier
Reference [20]—Vedic multiplier Reference [3] Proposed multiplier
Width 16-bits 16-bits 16-bits
Delay (ns) 13.452 27.148 11.514
Table 5 Comparison of 24-bit
multipliers with the proposed
multiplier
Slices LUTs Delay (ns)
Reference [12] 1306 2329 16.316
Proposed multiplier 972 1018 12.996
Tiryagbhyam binary multiplier. Binary multipliers for all the stages are designed and
tested at this stage. The results of implementation are shown in tables.
Table 2 shows the performance analysis of various-bit-length Karatsuba–Urdhva
Tiryagbhyam binary multipliers. Tables 3, 4, 5 and 6 show the comparison of some
existing models of different word lengths with the proposed multiplier. Note that the
comparison of area is done according to the number of LUTs used and the number of
LUTs used actually vary with different coding styles used. Figure 15 shows the per-
centage change in area and delay with increasing bit length. The values are calculated
by determining the relative change in area and delay when the size of the multiplier
changes from 8 to 16bits, 16 to 32bits and so on. For example, when the size of the
multiplier changes from 16 to 32bits, the change in area is 3.3768 times the area of
16-bit multiplier. From this graph, it can be seen that the increase in percentage area
is not significant and also the percentage increase in delay decreases with increasing
bit length.
Author's personal copy
Circuits Syst Signal Process
Table 6 Comparison of 32-bit multipliers with the proposed multiplier
LUTs Delay (ns)
Reference [20]—modified Booth multiplier (Radix-8) 2721 12.081
Reference [20]—modified Booth multiplier (Radix-16) 7161 11.564
Reference [20] 2704 9.536
Proposed multiplier 1545 13.141
2.696
3.3768 3.41911.2656
1.231
1.1314
1.05
1.1
1.15
1.2
1.25
1.3
0
0.5
1
1.5
2
2.5
3
3.5
4
ch
an
ge
 in
 d
el
ay
Ch
an
ge
 in
 a
re
a
8                 16                 24                  32           40            48            56
bit length
Area Delay
Fig. 15 Percentage change in area anddelaywith change inword length ofKaratsuba–UrdhvaTiryagbhyam
multiplier
The second stage is the design and testing of individual floating-point units with
different precision formats and integration of all the individual units to implement
the run-time-reconfigurable multi-precision floating-point multiplier. The results and
analysis of the implementation are discussed below.
Table 7 shows the performance analysis of different floating-point units having
different precisions. It can be seen that with increasing word length, the increase in
delay is significantly low. Table 8 shows the delay and area comparison of single-
precision floating-point multiplier with single-precision FP multiplier in the proposed
model. Even though the area is comparable to other models, the delay is much better
in the proposed single-precision floating-point unit. Figure 16 shows the percentage
variation in area and delay with increasing precision of floating-point units. The values
are calculated by determining the relative change in area and delay when the size of
the floating-point multiplier of different precisions changes. For example, when the
precision changes from 8 to 16bit and from 16 to 32bit, the area changes 2.05 times
and 2.39 times, respectively. It can be seen that the increase in area and delay is
significantly less with the increase in word length.
The analysis results of multiplication of two double-precision floating-point num-
bers in different modes are shown in Table 9. From the table, it can be seen that
the precision of mantissa of final result varies in lower precision modes, but only by
Author's personal copy
Circuits Syst Signal Process
Table 7 Performance analysis of various floating-point units in the proposed run-time-reconfigurable
multi-precision floating-point multiplier
8-bit precision
floating-
point
multiplier
16-bit
precision
floating-
point
multiplier
23-bit
precision
floating-
point
multiplier
Double-
precision
floating-
point
multiplier
Slices 157 497 1259 4587
LUTs 220 451 1078 3983
IOBs 61 83 104 193
Delay (ns) 7.234 10.234 11.008 12.785
fmax (MHz) 344.767 323.630 309.449 255.213
Table 8 Delay and area
comparison of single-precision
floating-point multiplier with
single-precision FP multiplier in
the proposed model
Slices LUTs Delay (ns)
Reference [12] 1269 2270 18.783
Reference [13] 1149 1146 –
Proposed multiplier 1259 1078 11.008
2.05
2.39
3.6951.415
1.076 1.161
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0
0.5
1
1.5
2
2.5
3
3.5
4
Ch
an
ge
 in
 d
el
ay
Ch
an
ge
 in
 a
re
a
8                 16                 24                  32           40            48            56
Bit length
Area Delay
Fig. 16 Percentage change in area and delay with change in word length of various floating-point units in
the proposed run-time-reconfigurable multi-precision floating-point multiplier
around 11000 , which is good enough for integer-level precision and for small numbers
with small exponents. The variation in precision of differentmodes is shown in Fig. 17.
From the graph, it can be seen that the precision varies slightly in 8- and 16-bit modes
but not much variation is there in other higher-order precision modes. So it can be
concluded that the lower precision modes can be used for integer-level precision and
high-level precision modes can be used for greater precisions.
Author's personal copy
Circuits Syst Signal Process
Table 9 Analysis of results of double-precision floating-point numbers in different modes
Result in double-
precision format
Value in decimal Variation of mantissa
in result
Input 1 4069b130ae804118 1.605759317 × 27 –
Input 2 4069b130ae804118 1.605759317 × 27 –
Auto-mode 40e4a0b1337cdfbd 1.289231492 × 215 0.0
8-bit precision 40e49ec800000000 1.288978577 × 215 0.000252915
16-bit precision 40e4a0b01b480000 1.289072997 × 215 0.000158495
23-bit precision 40e4a0b11c33e320 1.289231405 × 215 0.000000087
Double precision 40e4a0b1337cdfbd 1.289231492 × 215 0.0
1.288
1.2882
1.2884
1.2886
1.2888
1.289
1.2892
1.2894 1.289231492
1.288978577
1.289072997 1.289231405 1.289231492
Auto                       8-bit                    16-bit                      23-bit            double precision
Variaon of precision in different modes
1.289231492
Fig. 17 Variation of precision of multiplication results in different modes
Figure 18 shows the reduction in area while using run-time-reconfigurable multi-
precision floating-point multiplier when compared to conventional double-precision
floating-point multiplier. It can be seen that area usage for lower precision modes
are very much less and this will drastically reduce the power consumption at
lower precision modes. So if there are no constraints of precision, the device
can be operated in lower precision modes with literally ‘zero’ power consump-
tion.
The last stage in the design process is the implementation and testing of the com-
plete run-time-reconfigurable multi-precision floating-point matrix multiplier. The
complete implementation is done by implementing Strassen algorithm and integrat-
ing proposed run-time-reconfigurable multi-precision floating-point multiplier in the
design. The top module of the processing element is shown in Fig. 19.
Author's personal copy
Circuits Syst Signal Process
3983 3983
220 451
1078
2371
0
1000
2000
3000
4000
5000
N
um
be
r o
f L
U
Ts
8-bit                 16-bit              23-bit               36-bit                52-bit
Precicion of floang point units
Area reducon for different precision modes
Convenonal DP FP mulplier Run-me reconfigurable FP mulplier
Fig. 18 Reduction in area while using run-time-reconfigurable multi-precision floating-point multiplier
when compared to conventional double-precision floating-point multiplier
Fig. 19 Top module of
processing element
5 Conclusion and Future Scope
The result obtained from this work emphasizes the importance of run-time-
reconfigurability and variable-precision requirements in electronic design. The results
show that the delay and power consumption of the chip can be adequately adjusted
according to different precision requirements. Also, the proposed design can reduce
the percentage increase in area and delay with increasing word length of floating-
point multipliers by using an efficient combination of Karatsuba algorithm andUrdhva
Tiryagbhyam algorithm for implementing the unsigned binarymultiplier. Even though
Author's personal copy
Circuits Syst Signal Process
it is efficient in terms of area, power and delay in the run-time environment, the overall
area requirement of the model is more when compared to conventional designs. This
is because of parallelism applied to the model. Hence, much improvement is needed to
reduce the overall area requirement of the project. This design has broad applications
in the area of image and signal processing where it can be used in portable devices
and for military applications where saving battery power is very much important.
References
1. A. Amira, A. Bouridane, P. Milligan, P. Sage, A high throughput FPGA implementation of a bit-
levelmatrix-matrix product, in Proceedings of the 43rd IEEE Midwest Symposium on Circuits and
Systems, (2000), pp. 396–399
2. N. Anane, H. Bessalah, M. Issad, K. Messaoudi, Hardware implementation of variable precision
multiplication onFPGA, in4th InternationalConference onDesign&Technologyof Integrated Systems
in Nanoscale Era, (2009), pp. 77–81
3. R.K. Bathija, R.S. Meena, S. Sarkar, R. Sahu, Low power high speed 16×16 bit multiplier using Vedic
mathematics. Int. J. Comput. Appl. 59(6), 41–44 (2012)
4. F. Bensaali, A. Amira, A. Bouridane, An FPGA based coprocessor for large matrix product imple-
mentation, in IEEE International Conference on Field-Programmable Technology (FPT), (2003), pp.
292–295
5. F. Bensaali, A. Amira, R. Sotudeh, Floating-point matrix product on FPGA, in IEEE/ACS International
Conference on Computer Systems and Applications (AICCSA), (2007), pp. 466–473
6. I. Bravo, P. Jime’nez, M. Mazo, J.L. Lazaro, J.J. de las Heras, A.Gardel, Different proposals to matrix
multiplication based on FPGAs, in IEEE International Symposium on Industrial Electronics (ISIE),
(2007), pp. 1709–1714
7. C. Brunelli, P. Salmela, J. Takala, J. Nurmi, A flexible multiplier for media processing, in IEEE
Workshop on Signal processing System Design and Implementation, (2005), pp. 70–74
8. P. Corsonello, S. Perri, P. Zicari, A matrix product coprocessor for FPGA embedded soft processors,
in International Symposium on Signals, Circuits and Systems (ISSCS), (2005), pp. 489–492
9. H.S. Dhillon, A. Mitra, A reduced-bit multiplication algorithm for digital arithmetic. World Acad. Sci.
Eng. Technol. 19, 719–724 (2008)
10. http://en.wikipedia.org/wiki/Master_theorem
11. IEEE 754-2008, IEEE Standard for Floating-Point Arithmetic (2008)
12. A. Jain, B. Dash, A.K. Panda, M. Suresh, FPGA design of a fast 32-bit floating point multiplier unit,
in International Conference on Devices, Circuits and Systems (ICDCS), (2012), pp. 545–547
13. B. Jeevan, S. Narender, C.V. Krishna Reddy, K. Sivani, A high speed binary floating point multiplier
using Dadda algorithm, in International Multi-Conference on Automation, Computing, Communica-
tion, Control and Compressed Sensing, (2013), pp. 455–460
14. T.-C. Lee, M. White, M. Gubody, Matrix multiplication on FPGA-based platform, in Proceedings of
the World Congress on Engineering and Computer Science (WCECS), vol. 1 (2013)
15. K. Manolopoulos, D. Reisis, V.A. Chouliaras, An efficient multiple precision floating-point multiplier,
in 18th IEEE International Conference on Electronics, Circuits and Systems (ICECS), (2011), pp.
153–156
16. A. Mehta, C.B. Bidhul, S. Joseph, P, Jayakrishnan, Implementation of single precision floating point
multiplier using Karatsuba algorithm, in 2013 International Conference on Green Computing, Com-
munication and Conservation of Energy (ICGCE), (2013), pp. 254–256
17. B. Parhami, Computer Arithmetic (Oxford University Press, Oxford, 2000)
18. M. Poornima, S.K. Patil, S.K. Shivukumar, H. Sanjay, Implementation of multiplier using Vedic algo-
rithm. Int. J. Innov. Technol. Explor. Eng. 2(6), 219–223 (2013), ISSN: 2278-3075
19. B.S. Premananda, S. Samarth, S.S. Pai, B. Shashank, S.S. Bhat, Design and implementation of 8-bit
Vedic multiplier. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2(12), 5877–5882 (2013)
20. M.J. Rao, S. Dubey, A high speed and area efficient Booth recoded wallace tree multiplier for fast
arithmetic circuits, in 2012 Asia Pacific Conference on Postgraduate Research in Microelectronics &
Electronics (PRIMEASIA), (2012), pp. 220–223
Author's personal copy
Circuits Syst Signal Process
21. D.N. Sonawane, M.S. Sutaone, I. Malek, Resource efficient 64-bit floating point matrix multiplication
algorithm using FPGA, in TENCON, (2009), pp. 1–5
22. R.S.S. Teja, A. Madhusudhan, FPGA implementation of low-area floating point multiplier using Vedic
mathematics. Int. J. Emerg. Technol. Adv. Eng. 3(12), 362–366 (2013), ISSN 2250-2459
23. “Vedic mathematics”, Swami Sri Bharati Krsna Thirthaji Maharaja, Motilal Banarasidass Indological
publishers and Book Sellers, (1965)
Author's personal copy
