Abstract-The need for wireless communication has driven the communication systems to high performance. However, the main bottleneck that affects the communication capability is the Fast Fourier Transform (FFT), which is the core of most modulators. This paper presents FPGA implementation of pipeline digitslicing multiplier-less radix 2 2 DIF (Decimation In Frequency) SDF (single path delay feedback) butterfly for FFT structure. The approach taken; in order to reduce computation complexity in butterfly multiplier, digit-slicing multiplier-less technique was utilized in the critical path of pipeline Radix-2 2 DIF SDF FFT structure. The proposed design focused on the trade-off between the speed and active silicon area for the chip implementation. The multiplier input data was sliced into four blocks each one with four bits to process at the same time in parallel. The new architecture was investigated and simulated with MATLAB software. The Verilog HDL code in Xilinx ISE environment was derived to describe the FFT Butterfly functionality and was downloaded to Virtex II FPGA board. Consequently, the Virtex-II FG456 Proto board was used to implement and test the design on the real hardware. As a result, from the findings, the synthesis report indicates the maximum clock frequency of 555.75 MHz with the total equivalent gate count of 32,146 is a marked and significant improvement over Radix 2 2 DIF SDF FFT butterfly. In comparison with the conventional butterfly architecture design which can only run at a maximum clock frequency of 200.102 MHz and the conventional multiplier can only run at a maximum clock frequency of 221.140 MHz, the proposed system exhibits better results. It can be concluded that on-chip implementation of pipeline digit-slicing multiplier-less butterfly for FFT structure is an enabler in solving problems that affect communications capability in FFT and possesses huge potentials for future related works and research areas.
I. INTRODUCTION
FFT is significant block in several digital signal processing (DSP) applications such as biomedical, sonar, communication systems, radar, and image processing. It is a successful algorithm to compute discrete Fourier transform (DFT). DFT is the main and important procedure in data analysis, system design, and implementation [1] . Many modules have been designed and implemented in different platforms in order to reduce the complexity computation of the FFT algorithm. These modules focus on the radix order or twiddle factors to perform a simple and efficient algorithm which includes the higher radix FFT [2] , the mixed-radix FFT [3] , the primefactor FFT [4] , the recursive FFT [5] , low-memory reference FFT [6] , Multiplier-less based FFT [7, 8] and ApplicationSpecific Integrated Circuits (ASIC) system [9, 10] . A special class of FFT architecture which can compute the FFT in a sequential manner is the pipeline FFT. Pipelined architectures characterized by real-time, non stopping processing and present smaller latency with low power consumption [11] which makes them suitable for most DSP applications. There are two common types of the pipelined architectures; single path architectures and multi path architectures. Several different architectures have been investigated, such as the Radix 2 Multi-path Delay Commutator (R2 MDC) [12] , Radix 2 Single-Path Delay Feedback (R2 SDF) [13] , Radix 4 SinglePath Delay Commutator (R4 SDC) [14] , and Radix-2 2 SinglePath Delay Feedback (R2 2 SDF) [15] . The study made on the listed architectures shows that the Delay Feedback architecture is more efficient than the other delay commutator in terms of memory utilization. Radix-2 2 has simpler butterfly as Radix 2 and the same multiplicative complexity as Radix 4 algorithm [16, 17] . This makes Radix-2 2 single path delay feedback an attractive architecture for DSP implementation. The study of the digit-slicing technique has been dealt by [18] [19] [20] for the digital filters. The design and implementation of Digit-slicing FFT has been discussed by [21] . This paper proposed a similar idea with the ones put forth by [21] ; but having a difference by the use of a different algorithm, structure and different platform, which helps to improve the performance and achieve higher clock frequency. Recently, Field Programmable Gate Array (FPGA) has become an applicable option to direct hardware solution performance in the real time application. In this paper, digit-slicing architecture is proposed to design the pipeline digit-slicing multiplier-less Radix 2 2 SDF butterfly. The FFT butterfly multiplication is the most crucial part in causing the delay in the computation of the FFT. In view of the fact, the twiddle factors in the FFT processor were known in advance hence we proposed to use the pipeline digit slicing multiplier-less butterfly to replace the traditional butterfly in FFT.
II. RADIX 2 2 SDF FFT ALGORITHM
The more efficient architecture in terms of memory utilization is the delay feedback. radix-4 algorithm based single-path architectures have higher multiplier utilization; however, radix-2 algorithm based architectures have simpler butterflies and control logic. The radix 2 2 FFT algorithm has the same multiplicative complexity as radix 4 but retains the butterfly structure of radix 2 algorithm [15] . That makes the R2 2 SDF FFT algorithm the best choice for the VLSI implementation. In this algorithm, the first two steps of the decomposition of radix 2 DIT FFT are analysed, and common factor algorithm is used to illustrate.
The twiddle factor In the Equation (1) the index n and k decomposed as:
The total value of n and k is N. when the above substations are applied to Equation (1) the DFT definition can be written as: Where:
For normal radix 2 DIF FFT algorithms, the expression in the braces is computed first as a first stag in Equation (5). However, in radix 2 2 FFT algorithm, the main idea is to reconstruct the first stage and the second stage twiddle factors, as shown in Equation (8) as mentioned in [15] .
Observe that the last twiddle factor in Equation (8) can be rewritten as:
By applying Equation (8) and (9) in Equation (5) and expand the summation over n 2 , the result is a DFT definition with four times shorter FFT length.
Where,
Each term in equation (10) represents a Radix-2 butterfly (Butterfly I), while the whole equation represents Radix-2 butterfly, (Butterfly II) with trivial multiplication by (-j). Equation (10) known as radix 2 2 SDF FFT algorithm. Fig. 1 shows the butterfly signal flow graph for radix 2 2 FFT algorithm. Fig. 2 shows the 16 point R2 2 SDF FFT signal flow graph. Fig. 1 The butterfly structure for the radix 2 2 DIF FFT A. Butterfly I Structure Fig. 4 shows the Butterfly I structure, the input A r , A i for this butterfly comes from the previous component which is the twiddle factor multiplier except the first stage it comes form the FFT input data. The output data B r , B i goes to the next stage which is normally the Butterfly II. The control signal C1 has two options C1=0 to multiplexers direct the input data to the feedback registers until they filled. The other option is C1=1 the multiplexers select the output of the adders and subtracters.
The process of the Butterfly I is to store the anterior half of the N point input series in feedback registers, than butterfly calculation when the posterior half data is coming, the result of the butterfly is B r , B i , D r , D i . B r , B i fed to the output result of the Butterfly I the other result D r , D i goes to the feedback registers.
B. Butterfly II Structure
Fig . 5 shows the Butterfly II structure b. The input data B r , B i comes from the previous component, Butterfly I. The output data from the Butterfly II are E r , E i , F r and F i . E r , E i fed to the next component, normally twiddle factor multiplier. The F r and F i go to the feedback registers.
The multiplication by -j involves swapping between real part and imaginary part and sign inversion. The swapping is handled by the multiplexers Swap-MUX efficiently and the sign inversion is handled by switching between the adding and the subtracting operations by mean of Swap-MUX. The control signals C1 and C2 will be one when there is a need for multiplication by −j, therefore the real and imaginary data will swap and the adding and subtracting operations will switched.
In order to not lose any precision the divide by 2 is used where the word lengths imply successive growth as the data goes through adder, subtracter and multiplier operations. Rounding off has been also applied to reduce the scaling errors. 
C. Complex Multiplier
Normally the complex multiplier can be realized by four real multipliers, one adder and one subtractor as shown in Fig.  6 . This complex multiplier structure occupies large chip area in VLSI implementation. This complex multiplier can be realized by only three real multipliers and five real adder/ subtractor based on equation (13) ; this will save a lot of area in hardware implementation as shown in Fig. 7.  (a r +ja i )(b r +jb i )={b r (a r -a 
IV. FPGA IMPLEMENTATION OF PIPELINE DIGIT-SLICING MULTIPLIER-LESS RADIX 2 2 DIF SDF BUTTERFLY
Previous section explain in details the conventional structure of the R2 2 SDF butterfly, this section discuss how to apply the digit slicing technique for the R2 2 SDF butterfly component in order to reduce the complexity computation and enhanced the throughput.
The digit slicing multiplier less R2 2 SDF butterfly has been used the same component of the conventional structure except the complex multiplier which has been replaced with the digit slicing multiplier less.
The multiplication functionality is regarded as the most important operation for most signal processing systems, but it is a complex and expensive operation. Many techniques have been introduced for reducing the size and improving the speed of multipliers. In this paper we proposed digit slicing multiplier less to improve the speed of the multiplication. The design of the digit slicing complex multiplier has been made by Matlab to prove the working of the algorithm than we improved the design to be the digit slicing multiplier less.
The concept behind the digit slicing architecture is any binary number can be sliced into a few blocks of shorter binary numbers, with each block carrying a different weight [22] . In this paper, the 16 bits fixed-point 2's complements arithmetic has been chosen to represent the input data and the twiddle factor, which are singed numbers with absolute value less than one. Let us conceder the absolute value of the complex multiplier input data (the output of Butterfly II) is x with length of 16 bits has been represented in 2's complement as:
To represent the sliced data, the fundamental sliced algorithm will be presented as following:
Where x is sliced into b blocks and p is bit widths per block and X k,j are all either ones or zeros except for X k =b-1, j=p-1 which is zero or minus one. The digit slicing architecture has been applied for the complex multiplier input data (the output of Butterfly II) to slice the data to four groups each carrying four bits as shown in Fig. 8 and Fig. 9 . The complex multiplier realized by three real multipliers, as mention in previous section the digit slicing has been applied for the real multiplier input data to make the multiplication process parallel with the 16 bits twiddle factor as shown in Fig.  10 . Therefore the processing time will be reduced. To understand and prove the digit slicing algorithm the MATLAB design for the complex multiplier and the digit slicing multiplier has been made and the result has been compared as shown in Fig. 11 and Fig. 12 . Since the twiddle factors in FFT are known in advanced therefore the multiplication possibility for the 16 bits twiddle factor and multiply by 4 bits input data will be 16 possibilities can be stored in one RAM for each twiddle factor. This design will improve the digit slicing multiplier to be digit slicing multiplier less which has been replaced with the conventional multiplier as shown in Fig. 13 . The design of the digit slicing multiplier less consists of one lookup table (ROM) shift and adder to perform the output as shown in Fig 14. and Fig. 15 . To generate the lookup table data (the multiplication result possibilities), which are 16 different results, a special MATLAB program has been written by applying the digit-slicing algorithm for all the possible numbers for the input data (4 bits) from "0000" to "1111" to perform all the possibilities for the multiplication result. The storage of all these possibilities in one ROM allows the design to perform the multiplication process without any real multiplier. The Verilog HDL code in Xilinx ISE environment was derived to describe the Pipeline Digit-Slicing Multiplier-Less Radix 2 2 DIF SDF Butterfly functionality and was downloaded to Virtex II FPGA board. Consequently, the Virtex-II FG456 Proto board was used to implement and test the design on the real hardware.
V. RESULT Two different modules were implemented for R2 2 SDF DIF FFT butterfly. The first module uses the conventional architecture for the butterfly where the twiddle factors are stored in ROM and called by the butterfly to be multiplied with the inputs by utilising the dedicated high speed multiplier equipped with the Virtex-II FPGA.
The other module uses the pipelined digit-slicing multiplier-less architecture to perform the multiplication with the twiddle factor. Both modules were built and tested in MATLAB as indicated in previous section, then coded in Verilog and synthesized by using the XST-Xilinx Synthesis Technology tool. The target FPGA was Xilinx Virtex-II XC2V500-6-FG456 FPGA. The ModelSim simulation result of Pipeline Digit-Slicing Multiplier-Less Radix 2 2 DIF SDF Butterfly is shown in Fig. 16 , while the synthesis results for the two models are presented in Table 1 , which demonstrates the hardware specifications for the design. It indicates the maximum clock frequency of 555.75 MHz for Pipeline DigitSlicing Multiplier-Less Radix 2 2 DIF SDF Butterfly as well as the Pipelined Digit-slicing Single Multiplier-less for the butterfly with a performance of the maximum clock frequency of 609.980 MHz. Meanwhile, Fig. 17 shows the RTL schematic for the Pipeline Digit-Slicing Multiplier-Less Radix 2 2 DIF SDF Butterfly. VI. CONCLUSION This study presented of FPGA Implementation of Pipeline Digit-Slicing Multiplier-Less Radix 2 2 DIF SDF Butterfly for FFT Structure. The implementation has been coded in Verilog hardware descriptive language and was tested on Xilinx Virtex-I1 XC2V500-6-FG456 prototyping FPGA board. A maximum clock frequency of 555.75MHz has been obtained from the synthesis report for the Pipeline Digit-Slicing Multiplier-Less Radix 2 2 DIF SDF Butterfly which is 2.8 time faster than the conventional butterfly. It can be concluded that FPGA Implementation of Pipeline Digit-Slicing MultiplierLess Radix 2 2 DIF SDF Butterfly for FFT Structure is an enabler in solving problems that affect communications capability in FFT and possesses huge potentials for future related works and research areas.
