Abstract-To improve power figures of a dual ARM9 RISC core architecture targeting low-power digital broadcasting applications, the addition of a coarse-grain architecture is considered. This paper introduces two of these structures: PACT's XPP technology and the Montium, developed by the University of Twente, and presents the implementation of a Fast Fourier Transform on 1920 complex samples on both of them. Results in terms of processing time, resource utilization and energy dissipation are described and compared to those we have obtained on the RISC core. Then, as a conclusion, the paper presents the next steps of the development and some development issues.
I. INTRODUCTION
The DRM (Digital Radio Mondiale) standard [1] , [2] proposes the digitization of radio broadcasting in frequency bands below 30 MHz. A System on Chip (SoC) called DiMITRI was designed to show the feasibility of a DRM reception solution and to obtain a first receiver prototype [3] . Analyses showed that most computation power is used in the Coded Orthogonal Frequency Division Multiplexing (COFDM) [4] demodulation to compute Fast Fourier Transforms (FFT) and inverse transforms (IFFT) on complex samples. These FFTs have to be computed on non power-oftwo numbers of samples, which is very uncommon in the signal processing world. These algorithms already exist in software on a 32-bit ARM9 RISC core and our objective is to implement them on a more optimized structure which would reduce their power dissipation with limited impact on the silicon area.
More and more, digital systems demand the combination of high performance and low power dissipation to implement signal processing algorithms. The usual DSP and RISC implementations are very flexible at the expense of power dissipation (for a given algorithm). On the other hand, hardwired structures like ASICs lack flexibility and evolution capabilities but display the best results in terms of performance and power dissipation. Reconfigurable structures claim to bridge the gap between programmable processors and hard-wired structures through an average balance between flexibility and efficiency (which we define as the computing performance divided by the power dissipation). In this paper, the implementation of an 
A. PACTXPP Technology
The eXtreme Processing Platform (XPP) is a run-time coarse-grain reconfigurable architecture based on a 2D array of computing elements, internal memories and a circuit switch-like communication network [5] . The XPP64-A1 chip can be used as a standalone processor, or as a coprocessor next to a microcontroller [6] . Its structure is presented in figure 1 . The XPP64-A1 is built from an 8x8 array of 24-bit ALU-PAEs (Arithmetic and Logic Unit -Processing Array Elements) and two rows of 512 24-bit words RAM-PAEs on the sides. In each configuration, a PAE performs one dedicated operation. The array is coupled with a Configuration Manager responsible for the run-time management of configurations. ALUs do not have instruction sequencers and caches, since the operations to be performed are statically configured during the lifetime of a configuration.
The PAE objects are integrated within a network-onchip, providing point-to-point connections with data handshaking. The dataflow structure implies that an operation is performed as soon as all necessary input values are available and the previous output has been consumed by the downstream operation. The XPP is supported by a dedicated development suite. The architecture is programmed using the low level Native Mapping Language (NML). For our trials, we have used a board including an XPP64-A1, a microcontroller, some extemal memories, etc. Er-_____ Two different algorithms have been used to split up the FFT: the "divide and conquer approach" [9] , and in particular radix-2 and radix-4 algorithms [10] , and the Prime Factor Algorithm (PFA) [11] . The PFA turns the original transform into sets of small DFTs, the lengths of which have to be co-prime. It makes use of Good's mapping to convert the ID N=N1 N2 DFT into a 2D DFT in a row-column fashion. In our case of N=1920, we have chosen N1=128 and N2=1 5 and split up the FFT-1920 into 15 FFT-128 followed by 128 FFT-15. As 128 is a power of two, the FFT-128 can be performed using radix-2 and/or radix-4 algorithms that require O (N log2 N) Figure 3 outlines the implementation of the FFT-1920 using the PFA. The 128 FFT-15 are also computed using the PFA (five FFT-3 followed by three FFT-5) which implies the use of very efficient algorithms for computing both the FFT-3 and the FFT-5 [12] . The same flow as in figure 3 
The butterfly structure of the Montium can be used to calculate X(k) and X(1 5-k) concurrently. They are computed in pairs, using four ALUs. One such pair requires 7 clock cycles and thus 49 clock cycles are needed for all seven pairs. X(0) is calculated in parallel (on the fifth ALU) by adding all the inputs together. In this way, the total number of multiplications can be reduced by a factor 4 compared to the normal DFT-1 5. After the computations of the FFT-15, all the results are stored back in the memory in an order that facilitates the FFT-128 computations. A total of 7045 clock cycles is needed for all 128 FFT-15 calculations.
Afterwards, the 15 FFT-128 are executed on 15 blocks of data in the memory. Each FFT-128 is computed with a radix-2 algorithm, which also differs from the mixed-radix implementation used on the XPP. The details of the radix-2 FFT mapping to the Montium are shown in [7] . The results of the FFT-128 are stored back in the memory waiting to be read by the CCU. In contrast to the XPP implementation no external memory is needed. The size of the configuration file for the FFT-1920 is 2.6 kbytes. When the configuration is performed at 100 MHz, it can be loaded in 13 gs.
V. RESULTS Table I The use of the XPP architecture decreases the calculation time in cycles by a factor 36 and the energy consumed by 6. This architecture, originally built for intensive and regular operations (e.g. DCT, video processing) is flexible enough to compute a non-regular FFT such as the FFT-1920 . The main drawback is a very large silicon area3 which is not affordable for integration as an IP into the SoC we target4. In [7] , the power consumption of the Montium, in 0.13,m technology, is estimated at 0.577 mW/MHz. The results of the FFT-1920 implementation on the Montium are also listed in Table I . These results show a saving of a factor 35 in terms of processing time, and 14 in terms of power consumption compared to the RISC implementation, and a smaller area. Although its datapath is only 16-bit wide, the Montium architecture seems to be the most promising to decrease the power dissipation and speed up the computations of the COFDM demodulation. These results may be explained by the fact that the micro-sequenced structure of the Montium is more suitable to algorithms that require lots of local sequencing (e.g. read and write address generators for accessing the RAMs).
The authors have found no documented ASIC implementation of non-power-of-two FFTs. [15] presents a high-speed FFT-1 872 implemented on an FPGA6.
VI. CONCLUSION
The evaluation of the coarse-grain reconfigurable architectures has taught us that, like for the programmable processors, the choice of a coarse-grain reconfigurable structure must be adapted to the targeted application to get the best performance at the lowest cost (in terms of power consumption and silicon utilization). In the case of the XPP processor, it is well adapted to intensive processing on large sets of data such as DCT computation, MPEG4 3Many ALU PAEs are actually used for local micro-sequencers required by the FFT-1920 algorithm. 4 Future versions of the XPP structure are, however, planned to improve the power and silicon utilization figures.
Power figures on ARM9 and XPP do not include the external RAMs. 6 This implementation favors computation speed at the expense of silicon occupation. Further comparisons are difficult since we deliberately favored flexible solutions able to compute 18 types of FFT on a common hardware. decompression but a micro-sequenced structure like the Montium looks more promising for processing that are somehow less intensive but more complex to control. This argument was confirmed by the porting of the FFT-1920, which can be considered as a complex computation but does not require the full computation power provided by the XPP architecture.
Within the 4S project [14], our next steps will be the integration of the DRM application on a platform which comprises Montium processors to handle COFDM processing, an ARM9 core and some hard-wired signal processing accelerators.
The development time has not been taken into account in our experiments. However, the effort to port algorithms on coarse-grain reconfigurable structures is considerable when using the ad-hoc low-level languages (NML, pseudoassembler, etc.). The availability and the efficiency of compilers to quickly port algorithms described in C (or some other high level language) will be a key issue for the adoption of these structures in industry.
