Abstract-This paper presents the implementation of a novel parallel FFT algorithm on SmartCell, a coarse-grained reconfigurable architecture, which is targeted on data streaming applications. The proposed FFT algorithm achieves balanced workload and memory requirement among the computational units, while maintaining optimized data flow at low configuration and communication cost. The proposed parallel FFT algorithm is then mapped onto the SmartCell prototype device with 64 processing elements. Results show that the parallel FFT implementation on SmartCell is about 14.9 and 2.7 times faster than network-on-chip (NoC) and Morphosys, respectively. The implementation also shows about 3.6 times better energy efficiency when comparing with the pipelined FFT implementations on FPGA.
I. INTRODUCTION
Nowadays, the real time constraints of data streaming applications, especially over portable devices, often have stringent energy and performance requirements. The general purpose solutions, such as programmable general purpose processors (GPPs), are widely used in conventional datapath oriented applications due to their flexibility and ease of use. However, they can not meet the increasing requirements on performance, cost and energy in data streaming applications due to their sequential software execution nature. On the other hand, although application specific integrated circuits (ASICs) can provide best performance for specific applications, they generally have fixed data flow with predefined functionalities that makes them not able to accommodate to new system requirements or protocol changes. Reconfigurable architectures (RAs) have been proposed as a way to balance between the flexibility as of GPP and the performance as of ASICs. The most common SRAMbased FPGAs are still the dominating technology in RAs that decompose complex logic functions into smaller ones and map them onto the bit-level Lookup Tables (LUTs) or other on-chip embedded resources. However, the flexibility of bit-level fine-grained FPGAs comes at a cost of area, power consumption, speed and configuration time, due to its huge routing area overhead and timing penalty.
In recognition of these issues, several research projects have been developed to integrate more coarse-grained computing and communication components onto the same chip, as summarized in [1] . In this paper, SmartCell is presented as a novel coarse-grained reconfigurable architecture (CGRA), which is targeted at domain specific applications with inherent data-parallelism, high computing and communication regularities. By integrating a large number computing components and programmable switching fabrics onto the same chip, SmartCell has the potential to achieve the performance as of fixed function ASICs, meanwhile offering similar flexibility as of processors. Currently, a second generation SmartCell has been developed with dedicated on-chip data memories to store input data or intermediate results.
An overview of SmartCell is depicted in Fig. 1 . In a typical SmartCell system, a set of cell units are organized in a 2D mesh structure. Each cell consists of four processing elements (PEs) placed in east, west, south and north directions to form a quad structure. Each PE is composed of an arithmetic unit, a logic operator, I/O muxes and an instruction controller. In our current design, a 1K distributed data memory is attached to each PE to map large applications directly onto SmartCell. The PE structure and local data memory connections are shown in Fig. 2 . Each PE has full control of writing and reading of its own memory. The data read from a memory can be shared by two PEs to provide more flexibility. Distributed instruction memory is developed in each PE to provide the data path and functionality configurations. At runtime, an instruction code is loaded into the instruction controller to build the dataflow for a specific algorithm. The instruction code can be changed on a cycle by cycle basis.
In the SmartCell system, a three-level hierarchical routing structure is designed for the on-chip data communications. At first, four PEs in the same cell are connected with a fully connected non-blocking crossbar unit. Secondly, the adjacent cells are directly connected through short wires. By this means, the four PEs, placed at four edges in a cell, are directly linked with the nearest PE located in adjacent cell to form a static nearest neighbor connection. It provides bi-directional data transmission with no extra delay and synchronization requirements. At last, a hierarchical concentrated mesh (CMesh) network is developed to exchange data between nonadjacent cells, depicted in blue lines in Fig. 1 . It is studied in [2] that the CMesh has the potential to provide best performance in terms of average latency and network efficiency among other Network-on-a-Chip (NoCs), including Mesh, Torus, and FTree. Experiments show that the CMesh network can achieve a single hop communication throughput of 57.6 Gbits/s for a 4 by 4 SmartCell operating at 100 MHz. More design details of SmartCell can be found in [3] .
II. PROPOSE AND MAPPING OF A PARALLEL FFT ALGORITHM
Discrete Fourier Transform (DFT) is one of the most important digital signal processing algorithms in many communication and multimedia applications. An N -point Discrete Fourier Transform (DFT) is defined in Eq. 1.
where x(n) and X(k) are the complex input and output, and the twiddle factor W N = e −2πi/N . Fast Fourier Transform (FFT) is a fast DFT algorithm that reduces the computing for j ← 0 to N/2P − 1 do 5: get w 1 , w 2 based on i, j 6:
end for 9: end for N ) ). However, the intensive computations are still the bottleneck for largesize FFT/IFFT designs. Recently, several approaches have been proposed to compute the FFT in parallel on multiprocessor architecture [4] - [8] . In this section, we propose a novel parallel FFT algorithm and its mapping onto SmartCell architecture.
Two steps are developed in the proposed parallel FFT algorithm: local sequential execution and cross parallel execution. During the local sequential execution, the first log 2 (N/2P ) stages are computed sequentially in each butterfly unit (BU) operating on N/P locally stored data, where P is the total number of BUs. Algorithm 1 demonstrates the operations involved in the local sequential execution for each BU. Two signals are read out from two consecutive memory locations and are written back to two locations that are N/2P apart after processing. No cross communications are needed during this sequential step.
After the sequential step, a cross parallel execution is performed for the rest log 2 (2P ) FFT stages, as shown in Algorithm 2. During this step, the intermediate butterfly results need to be transferred among different BUs. We use BU _ID to represent the index number of current BU ranging from 0 to P − 1. Two cases are developed based on the number of FFT points N and the number of butterfly units P . When N is equal to 2P , only two points are stored in each BU. No sequential step is necessary in this case. On the other hand, when N is greater than 2P , more than 2 points are stored and processed by each BU, which requires both sequential and parallel steps. Generally, in the cross parallel execution, each processor calculates the butterfly results from its local data and then forwards the butterfly results to two processors based on its own BU _ID.
The proposed FFT algorithm achieves a fixed data transfer pattern among all FFT stages. Fig. 3 gives a data flow example of 8-point FFT with 4 butterfly units. According to the algorithm, BU0 calculates the butterfly results from its local data and sends them to memory 0 of BU0 and BU2. The same data flow is repeated during stage 2 and 3. The traditional FFT data transfer pattern is drawn in Fig. 4(a) , which involves different data flows at different FFT stages. for j ← 0 to N/2P − 1 do 5: if (N > 2P ) then 6: if (i == 0) then 7 : 
28: On the other hand, a fixed data transfer is achieved for all transform stages in the proposed parallel FFT algorithm, as shown in Fig. 4(b) . The optimized transfer pattern reduces both communication and configuration overhead, especially for large size FFTs.
The proposed FFT algorithm has been mapped onto the SmartCell architecture. The complex butterfly operation can be decomposed into 4 real multiplications and 6 real additions. In our design, two PEs are used to calculate the butterfly results in a sequential manner. Two butterfly units can be mapped onto the 4 PEs in the same cell. After the x (4) x (2) x (6) x (1) x (3) x (5) x (7) BU0_0 x(0)
x (1) x (2) x (3) x (4) x (5) x (6) x ( x (4) x (2) x (6) x (1) x (3) x (5) x (7) BU0_0 x(0)
x (1) x (2) x (3) x (4) x (5) x (6) x ( 
III. SMARTCELL SYNTHESIS RESULTS AND BENCHMARK COMPARISONS
A 4 by 4 prototype SmartCell with 64 PEs is designed and synthesized in standard cell ASIC with TSMC 90 nm process. 1K data memory is attached to each PE. This prototype system consists of about 2.0 million gates and can operate up to 295 MHz. The proposed parallel FFT algorithm is manually mapped onto the prototype system for performance evaluation. Up to 1024-pt FFT can be directly implemented in our current design. Furthermore, SmartCell can be reconfigured to operate on different size FFTs in 90 clock cycles, which is within 1μs running at 100 MHz. Table I compares the system throughput among different FFT platforms, including FPGA, multi-core NoC [4] and Morphosys CGRA [7] platforms. Due to different operating Table I  CYCLE frequencies, the number of clock cycles used to calculate one data block is compared. The NoC [4] and Morphosys [7] implementations fall into the same category of parallel FFT as of SmartCell. SmartCell outperforms all other platforms in all benchmark FFTs with respect to system throughput. SmartCell is about 24.1 times faster than the NoC implementation in 256-pt FFT. On average, SmartCell provides about 2.7x throughput gain comparing with the coarse-grained Morphosys implementation. This is mainly benefited from the reduced communication overhead by the proposed parallel FFT algorithm. Comparing with pipelined FPGA results, a throughput gain ranging from 1.0 to 1.7 is achieved by SmartCell. Moreover, due to its unbalanced memory loads and intensive controlling requirements, the pipelined FFT is not suited for reconfigurable architectures especially when the number of processing points need to be changed on the fly.
At last, the energy consumption for one FFT block is compared in Fig. 5 between SmartCell and Xilinx's Virtex II Pro 2VP20 FPGA. Only the core power consumption is recorded in FPGA for fair comparison. In 64-pt FFT, SmartCell consumes 77.3% less energy than FPGA, since no data memory is used in SmartCell in this case. On average, SmartCell is about 3.6 times more energy efficient than finegrained FPGA. Results show that SmartCell consumes 3.1 mW/MHz for the evaluated FFT benchmarks and achieves an average energy efficiency of 20.6 GOPS/W.
IV. CONCLUSIONS
In this paper, a novel two-step parallel FFT algorithm is proposed to provide balanced workload and fixed data flow pattern across all processing units. This algorithm is then mapped onto the SmartCell coarse-grained reconfigurable architecture. A prototype SmartCell system with 64 PEs is implemented in standard cell ASICs with TSMC 90 nm technology. SmartCell dissipates about 3.1 mW/MHz power with an energy efficiency of 20.6 GOPS/W. Comparing with other FFT implementations, SmartCell is about 14.9 and 2.7 times faster than the parallel FFT results in NoC and Morphosys, respectively. It is also about 3.6 times more energy efficient than FPGA implementations. The results demonstrate that SmartCell is a promising reconfigurable and energy efficient platform for stream processing.
