Abstract: This paper presents an efficient architecture for performing 128 points to 1M points Fast Fourier Transformation (FFT) based on mixed radix-2/4/8 butterfly unit. The proposed FFT architecture reduces the computation cost by taking the advantage of the radix-8 FFT algorithm while remaining compatible with sequences whose data length is an integral power of 2. Further optimizations for reconfigurable application specified processor are developed. First, we propose a separated radix-2/4/8 butterfly unit which is more flexible than an entire radix-2/4/8 butterfly unit; second, for the sequences longer than 256K points, an efficient 2-epoch FFT solution is realized. This FFT architecture is implemented in a reconfigurable application specified processor. The computation time of our architecture is 676 us and 14.8 ms for 128K and 1M points FFTs respectively.
Introduction
Pulse compression is a signal processing technique commonly used by radar, sonar and echo-graphy to increase the range resolution as well as the signal to noise ratio. As the demand for ultra-wideband radar signal processing increases, pulse compression of ultra-long series becomes a hot topic. Since fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT) are primary calculations in pulse compression, the modern real-time radar systems raise high performance demands for specific ultra-long series [1, 2] .
Reconfigurable computing is rapidly emerging as a third computing paradigm, along with hard-coded designs, often in the form of application specific integrated circuits (ASICs), and programmable systems, such as classic processors, digital signal processors (DSPs), and graphic processing units (GPUs). In hard-coded designs, the hardware datapaths and the computing algorithms are both fixed at production. In programmable systems, the hardware datapaths are fixed before production but implement generic primitives with some regular interconnection; algorithms are implemented post-production by freely scheduling operations on the datapaths through software. In the new paradigm of reconfigurable computing, many algorithms of the same category are customized to the specified application area and dynamically configured once the device is used. With the requirement to handle the high density of data and lots of different algorithms based on multiplication and addition computation, digital signal processing calls for more powerful computation capability and more comprehensive supportability for different algorithm used in application specified digital signal processing.
Common VLSI implementation of FFT architectures can be classified into three categories: memory-based architectures [3, 4, 5, 6] , cache-based architectures [7, 8, 9] and pipelined architectures [10, 11, 12, 13] . Memory-based architectures generally consist of processing units and memory blocks. Cache-based architectures adopt a data cache to reduce the memory access. Pipelined architectures own the benefits of high data throughput rate and low controller complexity. However, for ultra-long series, pipelined architectures bear the high area cost, and extra offchip accessing cost will be the determining factor for both memory-based architectures and cached architectures. For the series not very long, some works adopt larger memory which is enough to store all samples to avoid the off-chip accessing during computation periods. Since the memory capacity is limited by area cost, operating frequency and power consumption, the off-chip accessing is inevitable when comes to the ultra-long cases. Our design adopts the similar 2-epoch algorithm as Guan's and Baas's works that split long length FFT into two smaller FFT loops [9, 14] . However, the data length supported by proposed architecture is much longer and we use off-chip accessing instead of storing all data on chip. Moreover, based on the further consideration of system resources and the off-chip accessing bandwidth, the balanced architecture is proposed.
Compared with Guan's application-specified instruction set processor, our work introduces coarser grain instructions and employs a 2-epoch methodology with offchip accessing for ultra-long FFT. Chen's work can perform all integral power of 2 lengths, while it has further memory requirement for larger than 32K series. Although our work takes more area cost for 32-bit single-precision data format and large on-chip memory to cope with the ultra-long series, the comparison with some other works for 1K and 32K points series shows that our work achieves some performance improvement. Moreover, taken further consideration of computation and off-chip accessing balance, our design keeps well performance of longer series.
Although lots of efforts of FFT processing have been made in software and hardware, providing both flexibility and high throughput is challenging. In this work, we focus on ultra-long series FFT performing and system balancing. Our contribution lies in three aspects. First, we propose an efficient memory-based FFT architecture adopted in a reconfigurable application specified processor (RASP) for radar digital signal processing. The separated radix-2/4/8 butterfly unit which is more flexible than an entire radix-2/4/8 butterfly unit, reduces the compute level by taking the advantage of the radix-8 FFT algorithm while remaining compatible with sequences whose length is integral power of 2. The twiddle factor generation unit makes a good tradeoff between memory and logic cost and balances the utilization of adder and multiplier. Second, for sequences longer than 256K points an efficient 2-epoch FFT performing solution is realized. Taken the computing performance and off-chip accessing performance into account, the 2-epoch solution tries to make their time consumption approach and uses "tick-tock" methodology to overlap the accessing time consumption; Third, this work is implemented by following the standard cell-based IC design flow and coded in hardware description language (HDL), and the heterogeneous SoC with RASP have been fabricated on 40 nm complementary metal-oxide-semiconductor (CMOS) process.
The proposed FFT architecture applied in a taped out RASP for radar applications. Our RASP costs about 2.6 million gates or a core area of 3.7 mm 2 without SRAM in TSMC 40 nm CMOS technique. The computation time of our architecture is 676 us and 14.8 ms for 128K and 1M points FFTs respectively. The rest of paper is organized as follows: Section 2 introduces the FFT algorithm and the RASP architecture. Section 3 describes the optimization in the proposed architecture. Section 4 presents our FFT design, and Section 5 shows the experimental results. Finally, we make conclusions in Section 6. 
In general, a higher radix algorithm should be used to reduce the multiplication times in a butterfly computation and the computation stages.
The mixed-radix (MR) FFT refers to performing FFT with butterfly calculations of different radix. To radix-2/4/8 MR algorithm a situation, an input series
points, arises naturally. In this case, a radix-2 or radix-4 butterfly calculation performs at beginning of transform, while the rest of transform is calculated by radix-8 algorithm. Compared to radix-2 or radix-4, radix-2/4/8 MR algorithm reduces the computation cycles while remaining compatible with series whose length is integral power of 2.
2-epoch FFT algorithm for ultra-long series
When performing long length FFT, it is an important issue that solves the data interleaving among different stages. Adopting on-chip memory to store all of the data is an effective way to cope with the interleaving. When comes to the ultra-long sequences, it becomes inefficient and hard to implement that stores all data on-chip. Baas proposed a 2-epoch FFT algorithm to break down a long length transforming to 2 epoch, each of which processes a batch of shorter length FFT [14] . There is no data interleaving among a batch of short transforming and the data interleaving between 2 epochs is easy to handle. Taking this advantage, the butterfly processing and memory accessing of different short length FFT can be performed concurrently.
The N-points 2-epoch FFT algorithm is based on the following decomposition, there N ¼ L Ã C, the N-points sequence is presented as: 
From the Eq. (1), can easily derive: In the above decomposition, Eq. (5) is an L-points FFT of the column in the matrix Eq. (3) and Eq. (6) is a C-points FFT of the row in the matrix Eq. (3). Then the N-points ultra-long sequence FFT is decomposed into C L-points FFT, L C-points FFT and multiplications of the medial twiddle factors. The procedure of 2-epoch FFT is as follows: 1) Present the N-points sequence into a matrix have L rows and C columns; 2) Perform L-points FFT of C column sequences in the matrix; 3) Multiply the result by the middle twiddle factors. 4) Perform C-points FFT of L row sequences in the matrix; 5) Transpose and output the results.
A reconfigurable processor for radar digital signal processing
From the view of reconfigurable computation hardware, we realize an RASP for radar applications. As a large number of applications based on multiplications and additions widely uses in radar digital signal processing, our RASP integrates 17 applications for vector and matrix computation. This section presents the architecture of the RASP. FFT is one of the key applications integrated in RASP. The FFT instruction operands include source data address, source data length, destination data address, destination data length, twiddle factors data address, twiddle factor data length, series length, input/output order, inverse FFT flag and the number of off-chip access times for 2-epoch FFT. When instructions received from high performance bus, the RASP selects the reconfigurable controller of FFT and dynamically configures the data paths between memory, calculation logics and the reconfigurable controller to achieve the designed function.
The RASP works as a high performance application-specified processor in a heterogeneous system on a chip (SoC). In the SoC two DSPs, an RASP and a memory controller are connected by a 256-bit high performance bus which bases the advanced microcontroller bus architecture (AMBA) 3.0. The RASP employs two kinds of bus interfaces. As a master device, RASP can access the main memory through the directional memory access (DMA) engine, and as a slave device, RASP can be configured, suspended and interrupted by DSPs. The system employs a DDR3, whose max operating frequency is 1.067 GHz, as a main memory. Fig. 1 shows the overall architecture of the proposed RASP. This architecture consists of a main controller (MC), a reconfigurable controller (RC), a reconfigurable computing array (RCA), a direct memory access (DMA) unit, bus interfaces and memory.
The MC manages the overall controlling of RASP that includes instruction decoding, DMA configuration and RC configuration. The RC manages the procedures of computation and connects the data paths between memory and RCA. The RCA has six reconfigurable processing elements (RPEs), each of which consists of computation logic and reconfigurable data path. RPE1 to RPE4 are isomorphic processing elements, each of which consists of one complex multiplier, one real multiplier and 4 complex adders. RPE5 is used for the conversation between fixedpoint and float-point data type, and RPE6 which includes multipliers, adders and a divider, is mainly used for matrix inversion. Employing large on-chip memory is a key feature of the RASP. The proposed RASP uses 2 MB single port Static Random Access Memory (SRAM) as on-chip memory. The memory has 32 banks, each of them consists of eight 64-bit width and 1024 depth SRAM elements.
The correspondence in RASP includes control flow and data flow. Through the control path, shown in Fig. 1 , the MC receives the instructions from bus interface and sends the reconfiguration information to the RC and off-chip accessing information to the DMA engine. After the reconfiguration information received, the RC selects one algorithm control module and the connection among the RPEs. The data paths in Fig. 1 are responsible for the data flow in RASP. The DMA is used to read and write data between off-chip memory and on-chip memory. Adapted the memory switch, different banks provide data to appropriate computation logics.
Solved the data interleaving of 2-epoch FFT algorithm, a "tick-tock" processing strategy is used in our architecture. Instead of storing all data on chip, the proposed architecture simultaneously processes butterfly computation and accessing the data to be used in next stage. As Fig. 1 shows, the green dashed frame is responsible for butterfly processing and the red dashed frame is used to off-chip accessing. The onchip memory is box in blue dashed box. The on-chip memory is divided into two parts and used by butterfly processing and off-chip accessing alternately. Moreover, employed a DMA engine, which has similar throughput with the butterfly units, the off-chip accessing and butterfly processing effectively overlap in the proposed architecture. 
Flexible radix-2/4/8 mixed butterfly unit
While the higher radix algorithm reduces the computing stages, it also requires more adders and multipliers compared to radix-2 or radix-4 butterfly units [4, 15] . The increasing utilization of float point units (FPUs) in a complete higher radix butterfly unit (BU) significantly limits the practicability of higher radix algorithm in VLSI design. Moreover, since the sequence length should be an integral power of the radix, traditional higher radix algorithms are not flexible enough. Addressing at these limitations, we propose a flexible pipelined radix-2/4/8 BU. As shown in Fig. 2 , it takes a quarter of resources of an entire radix-8 BU to output the radix-8 computation results, 8 samples, in 4 clock cycles. By choosing different datapaths, the BU can perform radix-2, radix-4 or radix-8 butterfly calculations.
The BU consists of registers, multiplexers and FPUs. The registers are used between each pipeline stages in radix-8 algorithm and the multiplexers are used to control the data path between registers and FPUs.
The proposed BU uses in-place accessing method which means the outputs of the BU will be stored in the same memory location as the inputs. Equipping dualport SRAMs is a straightforward way to avoid the bank conflictions. However, the power consumption and area cost of the dual-port SRAM block is too large to be accepted. In this work we use a two-part memory. For each part memory, 16 single port SRAM banks with a conflict-free algorithm are used to meet the throughput requirement of the BUs.
A twiddle factor generation unit in a VLSI FFT architecture design is usually implemented by look-up-tables or real-time calculation units [16, 17] . By taking the advantage of the circular symmetry in twiddle factors, only the one eighth of the longest supported sequence length coefficients need to be stored, while it still takes considerable memory cost for ultra-long series.
Generating twiddle factors by trigonometric calculating requires high performance pipelined trigonometric units which are barely used in the other applications of RASP except FFT. Using the products of interpolations and the increment is another way to calculate twiddle factors in real time, but the iteration of multiplication may cause a precision reduction. To handle this problem, we adopt a twiddle factor generation unit using the products of interpolation.
ðk ¼ 1; 2; . . . ; 1M; m ¼ 1; 2; . . . ; 1024; n ¼ 1; 2; . . . ; 1024; Þ As the Eq. (7) illustrates, the twiddle factors for 1M FFT can be calculated by multiplying two groups of interpolations. Compared with storing all coefficients on chip, it saves near 2 MB storage with a multiplier. Compared with iterated multiplication twiddle factor generation, this method gets higher precision and lower storage cost.
The 2-epoch calculation for ultra-long sequence
The storage resources, one of the most important key points of FFT architectures, include on-chip memory and off-chip memory. On-chip storages are generally implemented by high performance devices, such as SRAMs. While satisfied the performance requirement SRAM devices take higher area cost and power consumption. With high storage density off-chip devices have much lower performance because of the long access latency and conflictions. In order to satisfy the application requirement the proposed architecture employs 2 MB SRAM to guarantee the performance of sequence shorter than or equal to 256K. Nevertheless, the architecture also uses a tick-tock memory management to schedule the memory occupied by the computation stage and the off-chip accessing stage alternately based on the 2-epoch FFT algorithm.
As the Fig. 3 shows, a 512K points FFT is broken down into 4 column computing stages and 4 row computing stages base on the 2-epoch FFT algorithm, and the storages are split into two parts. Most of the time, the two parts of storages are occupied by FPUs and off-chip accessing respectively.
2-epoch method is an efficient way to solve the interdependency problem in FFT, but it also leads extra off-chip accessing for ultra-long series. Off-chip accessing speed, influenced by DDR efficiency, is a key factor to the performance. For 2-epoch processing, the architecture is carefully designed to make the computation and off-chip accessing capability approach. To measure the off-chip accessing bandwidth, a simulation about the interactions of DMA and DDR is carried out. It shows that the accessing reaches 9.7 GB/s for continuous data and 5.2 GB/s for rectangle data accessing. However, it may be lower in a real SoC since the DDR competitions from other devices. The throughput of BUs to calculate 128K data for three stages is 2.67 GB/s. Without off-chip accessing hazards, this architecture will take 392 us to finish the computation of one stage and 309.7 us to finish the off-chip accessing. 
Experimental results
The proposed FFT design has been implemented in a reconfigurable application specified processor by following the standard cell-based IC design flow and coded in HDL. It has been verified through C behavioral simulation by UVM environment, SpyGlass HDL coding rule check, verilog RTL simulation, logic synthesis, verilog gate-level simulation, placement and routing, DRC, LVS, and post-layout simulation. The proposed design has been synthesized using a 40-nm CMOS technology.
The performances from 128 to 1M points are evaluated. Fig. 4 presents the computation, off-chip accessing and configuration cycles, and the cycles saved by tick tock memory accessing in 2-epoch algorithm. Before starting an application, the RASP will take about four thousands cycles to receive all the instructions from bus and decode them. As configuration overhead is influenced by bus efficiency, there is some uncertainty in configuration cycles. To the series longer than 256K points the overlaps of accessing and computation are also presented.
With an ideal off-chip memory device, the computation takes more cycles than accessing ones. However, it may get more balanced performance in the SoC, because the off-chip accessing performance will be influenced by DDR hazards.
The proposed RASP processes FFT from 128-points to 256K-points without off-chip accessing on the computing phase, and uses tick-tock off-chip access strategy for the sequences from 512K-points to 1M-points. It completes a 128K-points FFT in 676.5 µs and a 1M points FFT in 14.8 ms. Table I presents the hardware comparison of this work with 2 different FFT realizations: Chen's low memory access length adaptive FFT architecture for all integral powers of 2 [5] ; And a design for long length, Lin's FFT processor, which supports 32K points series [6] .
As the proposed architecture, which is designed for ultra-long series and single precision float data, employs float points computing logic and memory, the area cost of our design is larger the others. Based on an effective performance measure proposed in [14] , we introduce the factor of word length to normalize area efficiency as Eq. (8) .
Nevertheless, to make the comparison more reasonable, we compared the computation logic cost in number of multipliers. Our work uses two complex multipliers (CMs) and two real multipliers (RMs) in the BU. For N-points trans- 
Conclusions
In this work, we present an FFT architecture implemented in an application specified reconfigurable processor for ultra-long series. This architecture has been fabricated on 40 nm CMOS process. A radix-2/4/8 BU and a tick-tock memory accessing method are applied to supporting integral power of 2 sequence lengths from 128 to 1M. The twiddle factor generation unit used in this design saves lots of storage resource with multiplier while keeping high precision. The overall performance is greatly improved by parallel computation and tick-tock memory accessing arrangement. By further, using the 2-epoch algorithm and balancing computation and off-chip access the FFT architecture gets a well support for ultra-long series.
