Abstract-The implementation of transform precoding is one of the important steps for SC-FDMA baseband signal generation of physical uplink shared channel in TD-LTE. According to the definition of transform precoding, the mixed radix DFT which is made up of the radix-2, radix-3, radix-5 should be realized, and the transform length is not fixed. For unfixed DFT, this paper focuses algorithms, and designs an unfixed-point DFT universal module, which has already been achieved on the FPGA chip. Meanwhile, the hardware resources can be optimized, its processing speed and power consumption can basically meet the DFT requirements of uplink SC-FDMA in TD-LTE system.
INTRODUCTION
The baseband signal representing TD-LTE physical uplink shared channel (PUSCH) is defined in Figure1 [1] , and the transform precoding functions is the DFT transform, namely, to turn the time domain signal into the frequency domain signal. About the DFT implementation, a lot of literature were based on a fixed transform length [2] - [5] . However, this paper is the hybrid radix DFT implementations of the variable length for TD-LTE uplink baseband signal generated. 
A. Transform Precoding Points
According to the 3GPP TS 36.211 protocol, the DFT points for SC-DFMA symbols generated in the uplink physical channel shall fulfill . If we make a DFT module separately with every possible value for all the 34 kinds, more hardware resources will be consumed inevitably.
This article focuses on the mixed radix DFT algorithm composed of radix 2, radix 3, radix 5 for formula (1) , strives to minimize the computation in order to save FPGA resources, and improves processing speed, and determines algorithm process to achieve it on the FPGA finally.
B. DFT Algorithms Selection of Transform Precoding
If the input sequence length N = 2n, the radix-2 FFT algorithm can be adopted. FPGA software can be used directly because it has the corresponding IP core.
If the column length N ≠ 2n, and it is no longer broken down into prime factors, the direct DFT algorithm can only be adopted to calculate the exact value of N-point DFT. If N is a composite number, that is, it can be decomposed into the product of some factors, the unified FFT algorithm can be used. Currently these algorithms include Cooley-Tukey FFT algorithm, Good-Thomas FFT algorithm, Winograd FFT algorithm, Divide and conquer FFT algorithm [6] - [7] , etc. The performance comparison of the above four FFT algorithms is shown in Tab.1.
Because each two are prime for the radix-2, radix-3, radix-5, Good-Thomas FFT algorithm can be choosed. It can eliminate the different rotation factors, simplify the algorithm, and save resources. Next Cooley-Tukey algorithm is followed.
To ensure the signal to be processed with high-speed, real-time, the pipelined FFT processor can be adopted. The radix-2 FFT module includes seven sub-modules: 4-point, 8-point, 16-point, 32-point, 64-point, 128-point and 256-point; the non-radix-2 FFT module includes ten sub-modules: 3-point, 9-point, 15-point, 27-point, 45-point, 75-point, 81-point, 135-point, 225-point and 243-point. When the mixed radix DFT of specific points is processed, these combination corresponding to the radix-2 and non-radix-2 should be found in order to break out of these same modules, so the above 34 kinds of DFT can simplified. The implement structure of uplink transform precoding is shown in Figure2.
The IP cores on the computing module can be called directly for radix-2 FFT. For the operation of non-radix-2 FFT, the module can be separated into radix-3 and radix-5 modules again. For example, the radix-3 modules include: the 3-point, 9-point, 27-point, 81-point, 243-point DFT module. The radix-5 modules include: the 5-point and 25-point DFT module. At the same time, the next design can repeatly call this designed module. For example, 9-point DFT can call the 3-point module, 27-point DFT can call the designed 3-point DFT and 9-point DFT modules.
III. FPGA IMPLEMENT OF UPLINK TRANSFORM PRECODING

A. A. General Part Process of the Variable Point DFT
Generally, DSP is as the main processor and the FPGA is as co-processors to work with DSP in practical application. TD-LTE demands real-time processing of data roundly and deals with many kinds of points and different types simultaneously, so high-performance FPGA must be selected. Here, Virtex-5 family XC5VSX95T chip [8] is uesed because it a high-performance signal processing applications platform with advanced serial connectivity.
Data input and output are carried out through the RAM, while ROM provides the necessary address for RAM to ensure the correct logical order of data processing. General part of the variable point DFT is shown as Figure3. DSP sends the points N of the DFT transform to FPGA before the FPGA begins to receive data, then the FPGA calculates the corresponding DFT.
B. B. Use of ROM
Figure3 shows the input and output of data are related with the ROM, and it is the key of handling the entire system with a reasonable order. The ROM setting that is the mapping data of input and output index is stored. Taking 1200-point (16 × 75=1200) for example, 16 and 75 are relatively prime, and the Good-Thomas algorithm can be adopted. 
The index transform of the output sequence: 
The basic capacity of Block RAM which is embedded in XC5VSX95T chip is 36 Kb, its number is up to 244, at the same time, each RAM module can be used as two independent 18 Kb modules. Block RAM can be configured as ROM which is initialized by using the memory initialization file (. Coe) and whoes internal contents remain unchanged after power, namely, the achievement of the ROM functions.
C. C. Use of RAM
The sort order of input / output data is achieved through the RAM design, and the proper design can save a lot of memory resources. The output of the ROM is as the writing data address of RAM. When the clock edge arrives, firstly the position sequence stored in the ROM are read out in order, then the address of this data is as RAM (1)'s written address, finally the input data can be stored in RAM(1) in a different location. So the data taxis can be completed when the input data comes, and the appropriate radix-2 FFT can be operated after reading out the data from RAM (1) 
IV. SIMULATION AND SYNTHESIS
The implement of the hardware includes: language is Verilog HDL; development platform is the Xilinx ISE 10.1 release; simulation platform is Modelsim SE 6.1d version; FPGA chip is Virtex-5 family XC5VSX95T chip; simulation frequency is 250MHz. For the 1200-point DFT, transform data input from the ready begin to the handle end, took about 73.2us. The output timing diagram of 1200-point DFT is in Figure5.
The data output time between the two yellow lines is 4800000ps, and the chip adopts the clock frequency with 250MHz. So, the number of the used cycle is 1200, namely, a data output for each clock cycle. For using chips with up to 550MHz clock module and the simulation with 250MHz clock, the processing time of simulation point of maximum 1200-point FFT is about 73.2us, which basically meets the requirements, so the chip can fully guarantee the hardware implementation.
The whole system has all 34 kinds of not fixed length common modules, and those modules were compiled and synthesized successfully. According to the synthesis report, LUTs is consumed 72%, the FFs are used 36%, Block RAM is occupied 17%, multiplier is consumed 46%, and the indicators basically meet the requirements.
V. CONCLUSION
Comparing the Variable Point DFT with the traditional fixpoint DFT approach, this paper focuses on the overall systematic variable points DFT process. To TD-LTE uplink transform precoding with the indefinite length features of mixed-radix DFT, the public module is extracted into many regular, modularized and easily programmed modules. So, a lot of hardware resources are saved, and the speed can basically meet the demands. 
