Floating point arithmetic is widely used in many areas, especially scientific computation and signal processing. For many signal processing, and graphics applications, it is acceptable to trade off some accuracy (in the least significant bit positions) for faster and better implementations. However, most of these modern applications need higher frequency or low latency of operations with minimal area occupancy. In this paper we describe an implementation of high speed IEEE 754 double precision floating point multiplier using tiling technique and targeted for Xilinx Virtex-6 Field Programmable Gate Array. Verilog is used to implement the design. The design achieved 436.815 MFlops with latency of seven clock cycles which is 97% fast compared to Xilinx floating point multiplier core. It handles the overflow, underflow cases and truncation rounding mode.
INTRODUCTION
In the majority of digital signal processing (DSP) applications the critical operation is the multiplication. Floating Point Arithmetic is widely used in many areas, especially scientific computation and signal processing. The advantage of floating-point representation over fixed-point and integer representation is that it can support a much wider range of values. The greater dynamic range and lack of need to scale the numbers makes development of algorithms much easier. The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. The IEEE floating point standard defines both single precision (32-bit) and double precision (64-bit) formats.
The IEEE Standard 754 compliant floating-point adder/ multiplier can be implemented using field programmable gate arrays [1] . The use of FPGA's permits fast and accurate quantitative evaluation of a variety of circuit design tradeoffs for addition and multiplication. FPGA's also permit accurate assessment of the area and time costs associated with various features of the IEEE floating-point standard, including rounding and gradual underflow. The design was partitioned over 4 Actel A1280 FPGA's, with a 3-stage pipeline and a cycle time of 245 ns. Addition has 3 cycle latency, while a multiplication requires 6 cycles: 1 for the exponent stage, 4 for the significand stage, and 1 for the normalization stage. But latency for multipliers was not reduced due to the need of 24 bit multiplier.
Single precision floating point arithmetic units are implemented on the Splash-2 architecture, the size of the floating point arithmetic units would increase between 2 to 4 times over the 18 bit format. A multiply unit would require two Xilinx 4010 chips and an adder/subtractor unit broken up into four 12-bit multipliers, allocating two per chip. A 16x16 bit multiplier was the largest parallel integer multiplier that could fit into a Xilinx 4010 chip. When synthesized, this multiplier used 75% of the chip area [2] .
Floating point operations are hard to implement on FPGAs because of the complexity of their algorithms. On the other hand, many scientific problems require floating point arithmetic with high levels of accuracy in their calculations. The FPGA implementations of addition and multiplication for IEEE single precision floating-point numbers trade-off area and speed for accuracy. The adder is a bit-parallel adder, and the multiplier is a digit-serial multiplier. Prototypes have been implemented on Altera, and peak rates of 7MFlops for 32-bit addition and 2.3MFlops for 32-bit multiplication have been obtained [3] .
A group of IEEE 754-style floating point units targeted at Xilinx VirtexII FPGA. Special features of the technology are taken advantage of to produce optimized components. Single-precision Pipelined designs results the latency of 1OOMHz [4] .
High-precision floating-point applications on reconfigurable hardware require large multipliers [5] . Full multipliers are the core of floating-point multipliers. Embedded multipliers and adders in the DSP blocks of recent FPGAs are used for the automate generation of reconfigurable multipliers.
An efficient IEEE 754 single precision floating point multiplier has been implemented and targeted for Xilinx Virtex-5 FPGA [6] .The multiplier handles the overflow and underflow cases but rounding is not implemented. The design achieves 301 MFLOPs with latency of three clock cycles. The multiplier was verified against Xilinx floating point multiplier core. The significand or coefficient or mantissa is the part of a floating-point number that contains its significant digits. Exponentiation is a mathematical operation, written as a n , involving two numbers, the base a and the exponent (or power) n. When n is a positive integer, exponentiation corresponds to repeated multiplication. The Double Precision Floating-Point Format is shown in figure 1. An 11-bit ripple carry adder is used to add the two input exponents. The black box view of adder module (adder1) is shown in Figure 4 . . The mantissa of operand A and operand B, and the leading '1' (for normalized numbers) are stored in the 53-bit registers (mul_a) and (mul_b) respectively. Multiplying all 53 bits of mul_a by 53 bits of mul_b would result in a 106-bit product. Depending on the synthesis tool used, this might be synthesized in different ways that would not take efficient advantage of the multiplier resources in the target device. 53 bit by 53 bit multipliers are not available in the most popular Xilinx and Altera FPGAs, so the multiply would be broken down into smaller multiplies and the results would be added together to give the final 106-bit product. Instead of relying on the synthesis tool to break down the multiply, which might result in a slow and inefficient layout of FPGA resources, the module (fpu_mul) breaks up the multiply into smaller 24-bit by 17-bit multiplies. The Xilinx Virtex6 Device contains DSP48E slices with 25 by 18 twos complement multipliers, which can perform a 24-bit by 17-bit unsigned multiply.
The products are added together, with the appropriate offsets based on which part of the A and B arrays they are multiplying. For example, product_b is offset by 17 bits from product_a when adding product_a and product_b together. Similar offsets are used for other product when adding them together. The summation of the products is accomplished by adding one product result to the previous product result instead of adding all products together in one summation. The goal is to take advantage of the adders in the Virtex6 DSP48E slices that follow each 24 by 17 multiply block.
FLOATING POINT MULTIPLICATION ALGORITHM
The normalized floating point numbers have the form Z = (-1 S ) * 2 (E -Bias) * (1.M). To multiply two floating point numbers the following procedure is adopted. 
MULTIPLYING THE MANTISSA BY USING TILING TECHNIQUE
The mantissa multiplier will be built using the tiling technique. Let us consider our multiplication operands A and B on p and q bits respectively. Multiplication of multiplier (A) and multiplicand (B) can be done by efficient use of the DSP blocks in FPGAs. The technique consists in tiling a p×q rectangular board using a minimal number of such multipliers. Starting from the tilled board, the circuit equation is obtained using a simple rewriting technique Multiply the each sub tile and get product length of all tiles as Equation 1 is used to make full use of the Virtex-6 internal DSP adders. Due to the fixed 17-bit shifts between the operands, each sub-sum S0 and S1 may be computed entirely using DSP block resources. So in this algorithm the number of adders required for adding partial products are reduced to three (i.e. addition of S0, S1, M0).
UNDERFLOW/OVERFLOW DETECTION
Overflow/underflow means that the result's exponent is too large/small to be represented in the exponent field. The exponent of the result must be 11 bits in size, and must be between 1 and 2046 otherwise the value is not a normalized one. An overflow may occur while adding the two exponents or during normalization. Overflow due to exponent addition may be compensated during subtraction of the bias, resulting in a normal output value (normal operation). An underflow may occur while subtracting the When an overflow occurs an overflow flag signal goes high and the result turns to ±Infinity (sign determined according to the sign of the floating point multiplier inputs). When an underflow occurs an underflow flag signal goes high and the result turns to ±Zero (sign determined according to the sign of the floating point multiplier inputs). Denormalized numbers are signaled to Zero with the appropriate sign calculated from the inputs and an underflow flag is raised. Assume that E1 and E2 are the exponents of the two numbers A and B respectively, the result's exponent is calculated by using the equation 
FLOW CHART OF HIGH SPEED DOUBLE PRECESSION FLOATING POINT MULTIPLIER
The flow chart of high speed double precession floating point multiplier is shown in figure 8. 
SIMULATION RESULTS
The high speed double precision floating point multiplier design based on tiling technique was simulated in Modelsim 6.6c and synthesized using Xilinx ISE 13.1i which was mapped on to Virtex-6 FPGA. The simulation results of 64-bit high speed double precision floating point multiplier are shown in figure 9 . The 'a' and 'b' are the inputs and 'fpout' is the output. Table 2 shows the device utilization for implementing the circuit on Virtex-6 FPGA. 
CONCLUSION
The high speed double precision floating point multiplier supports the IEEE 754 binary interchange format, targeted on a Xilinx Virtex-6 xc6vlx75t-3ff484 FPGA. It achieves 436.815 MFLOPs which is 30.9% and 97% fast compared to [6] and Xilinx core respectively. This design occupies 433 slices which is 28% less compared to [6] and 38.6% more compared to Xilinx core. In terms of number of used flip flops, this design uses 197 flip flops i.e. 32.7% and 18% less compared to [6] and Xilinx core. This design handles the overflow, underflow, and truncation rounding mode.
