INTRODUCTION
Floating point multiplication units are essential Intellectual Properties (IP) for modern multimedia and high performance computing such as graphics acceleration, signal processing, image processing etc. There are lot of effort is made over the past few decades to improve performance of floating point computations. Floating point units are not only complex, but also require more area and hence more power consuming as compared to fixed point multipliers. And the complexity of the floating point unit increases as accuracy becomes a major issue. Even a minute error in accuracy can cause major consequences. These errors are possible in floating point units mainly because of the discrete behavior of the IEEE-754 [1] floating point representation, where fixed number of bits is used to represent numbers. Due to the high computational requirements of scientific applications such as computational geometry, climate modeling, computational physics, etc., it is necessary to have extreme precision in floating point calculations. And these increased precision may not be provided with single precision or double precision format. That further increases the complexity of the unit. But some applications do not require high precision. Even an approximate value will be sufficient for the correct operation. For applications which require lower precision, the use of double precision or quadruple precision floating point units will be a luxury. It wastes area, power and also increases latency.
For devices such as portable or wearable devices in which accuracy requirement varies with different applications and also power consumption is a very important factor, use of high precision floating point multipliers is not a good option. In such cases a variable precision multiplier will be a good option which can save much power and time when application doesn't need high precision. There are a lot of such models like [2] , [3] and [4] . Most of such designs make use of already available IPs such as DSP (Digital Signal Processing) units and 18x18 multiplier units. In this proposed paper, we present a power efficient design of floating point multiplier with different modes of accuracy selection. With different precision modes, we can select the mode which is appropriate for the concerned application. As accuracy requirement decreases, the width of multiplier decreases and hence the power consumption and latency.
II. PROPOSED MODEL
The proposed model is a reconfigurable multi-precision floating point multiplier which can be operated in six different modes according to the accuracy requirements. It can perform floating point format multiplication of different mantissa sizes depending on the precision requirement. The basic unit is a Double-precision floating point unit. According to the precision selected, the size of the mantissa is varied. Fig. 1 shows the floating-point multiplication format used in the proposed model.
The multiplier accepts two inputs each of 67-bit wide. The first 3 bits are used for mode selection. The inputs to the multiplier can be given in double-precision floating point format with first 3 bits (66 th bit to 64 th bit) as mode select bits.
The value of the mode select bits for both t the same, otherwise a mode select erro generated and the execution will be stopp mode select bit combinations for different m 
III. FLOATING PO
A floating point number is r [1] as or perform multiplication of two and 2 , the sign multiplied to get the produc added to get the product expon 2 . The hardware multiplier is shown in fig. 3 . The important blocks in the floating point multiplier is desc
A. Sign Calculation
The MSB of floating point The sign of the product will b are of same sign and will b opposite sign. So, to obtain the a simple XOR gate as the sign
B. Addition of Exponents
To get the product exponent together. Since we use a bia exponent, we need to subtra exponents to get the actual e 127 (01111111 ) for s 1023 (0111111111 ) for proposed custom precision for The computational time of man much more than the exponen carry adder and ripple borr exponent addition. ultiplication which require an ing 8-bit and 16-bit multipliers ouble precision floating point wer and can increase the speed. ultiplier used for mantissa d by using a combination of and Urdhva-Tiryagbhyam [6] optimization in terms of speed OINT MULTIPLIER represented in IEEE-754 format [7] . To floating point numbers 1 nificant or mantissa parts are ct mantissa and exponents are nent. i.e.; the product is 1 block diagram of floating point e implementation of proposed cribed below [8] .
number represents the sign bit. be positive if both the numbers be negative if numbers are of e sign of the product, we can use calculator.
t, the input exponents are added as in the floating point format act the bias from the sum of exponent. The value of bias is single precision format and double precision format. In rmat also, a bias of 127 is used. ntissa multiplication operation is nt addition. So a simple ripple row subtracter is optimal for m of the proposed model
C. Karatsuba-Urdhva Tiryagbhyam binary m
In floating point multiplication, mos complex part is the mantissa multiplicatio operation requires more time compared to ad number of bits increase, it consumes more double precision format, we need a 53x53 bi single precision format we need 24x24 requires much time to perform these operat major contributor to the delay of the floating To make the multiplication operation more faster, the proposed model uses a combina algorithm and Urdhva Tiryagbhyam algorithm Karatsuba algorithm uses a divide and where it breaks down the inputs into Most S Least Significant half and this process co operands are of 8-bits wide. Karatsuba algor for operands of higher bit length. But at low not as efficient as it is at higher bit lengths. problem, Urdhva Tiryagbhyam algorithm is stages. The model of Urdhva-Tiryagbhy shown in Fig. 4 . Urdhva Tiryagbhyam algo algorithm for binary multiplication in terms But as the number of bits increases, delay als partial products are added in a ripple manner 4-bit multiplication, it requires 6 adders con manner. And 8-bit multiplication requires 14 Compensating the delay will cause incre Urdhva Tiryagbhyam algorithm is not th number of bits is much more. If we use Kara higher stages and Urdhva Tiryagbhyam al stages, it can somewhat compensate the limi algorithms and hence the multiplier becom The circuit is further optimized by using carr save adders instead of ripple carry adders. delay to a great extent with minimal incre These two algorithms are explained in de sections. Fig. 3 o operands in adders 2 to 5, we implement adders 2 to 5. This a great extend compared to the ultiplication gorithm [4, 5] is best suited for rs. This method is discovered by is a divide and conquer method, bers into their Most Significant half and then multiplication is thm reduces the number of ing multiplication operations by ns operations are faster than speed of multiplier is increased. ts increase, Karatsuba algorithm algorithm is optimal if width of The hardware architecture of in Fig. 7 . nputs X and Y can be explained
m of Karatsuba multiplier
Where X l, Y l and X r , Y r are the Most Significant half and Least Significant half of X and Y respectively, and n is the number of bits. Then,
The Second term in equation (3) can be optimized to reduce the number of multiplication operations. i.e.; X l Y r X r Y l X l X r Y l Y r X l Y l X r Y r (4) The equation (3) can be re-written as,
The recurrence of Karatsuba algorithm is, 3
.
D. Normalization of the result
Floating point representations have a hidden bit in the mantissa, which always has a value 1 and hence it is not stored in the memory to save one bit. A leading 1 in the mantissa is considered to be the hidden bit, i.e. the 1 just immediate to the left of decimal point. Usually normalization is done by shifting, so that the MSB of mantissa becomes nonzero and in radix 2, nonzero means 1. The decimal point in the mantissa multiplication result is shifted left if the leading 1 is not at the immediate left of decimal point. And for each left shift operation of the result, the exponent value is incremented by one. This is called normalization of the result. Since the value of hidden bit is always 1, it is called 'hidden 1'.
E. Representation of exceptions
Some of the numbers cannot be represented with a normalized significand. To represent those numbers a special code is assigned to it. In the proposed model, we use four output signals namely Zero, Infinity, NaN (Not-a-number) and Denormal to represent these exceptions. If the product has 0 and 0, then the result is taken as Zero (±0). If the product has 255 and 0, then the result is taken as Infinity (∞).
If the product has 255 and 0, then the result is taken as NaN. Denormalized values or Denormals are numbers without a hidden 1 and with the smallest possible exponent. Denormals are used to represent certain small numbers that cannot be represented as normalized numbers. If the product has 0 and 0, then the result is represented as Denormal. Denaormal is represented as 0. s 2 , where s is the significand.
IV. IMPLIMENTATION AND RESULTS
The main objective of this work is to design and implement a floating point variable-precision circuit such that the device can reconfigure itself according to the precision requirements and can operate at high speed irrespective of accuracy and consume less power where accuracy is not an issue. Since mantissa multiplication is the most complex part in the floating point multiplier, we designed a multiplier which can operate at high speed and increase in delay and area is significantly less with increasing number of bits. The floating point multipliers of different modes with IEEE-754 standard format and custom precision format is implemented separately using Verilog HDL and tested. The binary multiplier unit (Karatsuba-Urdhva) are further optimized by replacing simple adders with efficient adders like carry select adders and carry save adders. The proposed model is implemented, synthesized and simulated using Xilinx Synthesis Tools (ISE 14.7) targeted on Virtex4 family. The model operates in a selected mode only and during operation, only the selected multiplier unit will be in ON state and all other multipliers units will be in OFF state. Hence, if a low precision mode is selected, the area and hence the power consumption will be less. The summary of results is given in table II and table III. Comparison with various multiplier units is given in tables IV, V, VI, VII and VIII. This paper describes a method to effectively adjust the delay and power consumption for different accuracy requirements. Also the paper shows how to effectively reduce the percentage increase in delay and area of a floating point multiplier with increase in number of bits by using a very efficient combination of Karatsuba and Urdhva-Tiryagbhyam algorithms. The model can be further optimized in terms of delay by using pipelining methods and precision of the result can be increased by adding efficient truncation and rounding methods. 
