Abstract-In this paper, we describe a processor architecture tailored to mixed-radix4/2/3 FFT algorithm. The proposed design supports all FFT sizes, namely 128-2048/1536, required by the LTE applications. The processor is based on the Transport Triggered Architecture processor architecture, which was customized with a set of function units, designed especially for the application at hand. The processor has been synthesized on an ASIC technology and both energy-efficiency and performance have been evaluated. The developed processor is programmable but shows energy-efficiency comparable to fixed-function ASIC implementations.
I. INTRODUCTION
Interest towards efficient implementation of Discrete Fourier Transform (DFT) started in 1965 from the famous Fast Fourier transform (FFT) algorithm [1] . Still, after almost half a century, remains very high due to fundamental useful properties of DFT. The recent boost of such interest is due to communication applications, in particular Long Term Evolution (LTE) and Software Defined Radio (SDR), e.g., [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] . In these applications, very efficient implementations of DFT are needed in order to support extremely tight, mutually contradicting constraints such as hard real-time requirements on top of low-power, lowcost, and flexible hardware platforms.
In LTE, computation of DFTs of a series of OFDM symbols is needed [11] with the speed of 66.67 µs per symbol. Each symbol is a vector of complex numbers of the length N where N may take one of the following values: N = 128, 256, 512, 1024, 1536, or 2048 [11] . At the same time, the design should be really low-power and low-cost to be useful, since, the main target devices are portable consumer electronics such as mobile-(smart-)phones, laptops, etc. On the other hand, business models require flexible programmable implementations. An important use case is SDR where software implementation of several radios, one of them typically being LTE, should be supported on top of a shared hardware platform [12] . Therefore, in SDR, even wider range of FFT sizes need to be supported under even tighter requirements.
There is a vast amount of different implementations of FFT, e.g., [2] - [10] to mention only few most recent publications related to communication applications. In particular, mixedradix4/2 [2] - [7] , and mixed-radix4/2/3 [8] variable length FFT implementations were proposed. In most of the publications, either special purpose (fixed or reconfigurable) FFT hardware architectures [4] - [8] or software FFT implementations on existing processor architectures [9] - [10] are proposed. Conventionally, hardware implementations are thought to provide better time and power performance but poor flexibility while the software implementations are thought to provide high flexibility but poor performance in terms of execution time and power consumption.
In this work, we propose a new customized Transport Triggered Architecture (TTA) based processor for programmable implementation of mixed-radix4/2/3 FFTs of sizes N = 4 m 2 n 3. Thus, in particular, FFTs of all the sizes needed in LTE, 128, 256, 512, 1024, 1536, and 2048, are supported. Compared to the solutions of [2] , [3] , the proposed architecture achieves not only higher flexibility, by supporting implementation of sizes being multiples of 3, but also achieves better performance obtained by the optimization of the previous designs. In particular, modified functional units for twiddle factor generation and an operand address generation, feature shorter critical paths, which allowed to synthesize the processor for higher frequencies.
II. FFT ALGORITHM
The DFT of an input vector
T is defined as the vector y = [y 0 , y 1 , ..., y N −1 ]
T such that:
or equivalently:
where F N is the (N × N )-matrix of DFT with entries:
Many FFT algorithms were developed for efficient computation of the DFT. In this paper, we are using the in-place, decimation-in-time (DIT), mixed-radix-4/2/3 algorithm with permuted input, and in-order output. The formula for our FFT algorithm of size N = 4 m 2 n 3 l where m ∈ N 0 ; n ∈ N 0 ; l ∈ {0, 1}, is given as follows:
The following notations were used in this formula:
where
An identity matrix of order N is denoted by I N and ⊗ denotes a tensor product. The matrices T , holding twiddle factors for the corresponding radix-4, radix-2, and radix-3 computation stages, are obtained with:
where ⊕ denotes a matrix direct sum. Finally, R 4 m 2 n 3 l is an input permutation matrix based on the stride-by-S permutation matrices P S N of order N and is given as
With the formula (4), a DFT of an input vector of size N = 4 m 2 n 3 l can be computed in n + m + l stages. An example for 24-point FFT where N = 24 = 4 1 2 1 3 1 is shown in Fig. 1 .
Each stage consists of two substages. In the first substage, multiplication of a diagonal matrix to the vector of intermediate results is performed. This, in fact means multiplying so called twiddle factors (powers of W N = exp − Each sparse matrix is presented as a Kronecker product and may, by row and column permutations, be transformed to a block-diagonal matrix with N/r blocks being DFTs of size r where r = 4 for the first m stages, r = 2 for the next n stages, and r = 3 for the last l stages (l = 0, 1). Multiplying such matrix to a vector means implementing N/r so called radix-r butterfly operations.
Properties of the employed representation (4) release certain benefits as compared to other FFT algorithms. We took advantage of them while implementing our processor. In particular, an in-place algorithm ensures that the size of the data memory used, can be limited to the maximum size of the FFT that the processor supports. This is especially important in the embedded applications. The chosen order of the different radix stages, as well as, the use of the in-order output rather than inorder input algorithm, simplified some of the functional units of the processor. The operand address generator, and twiddle factor generators, described in more details in Section IV, are build up from fewer gates performing less switching activity compared to [2] , [3] . This should lead to the decrease of both static and dynamic power consumption.
III. TRANSPORT TRIGGERED ARCHITECTURE
The proposed processor is based on Transport Trigger Architecture (TTA) [13] template. The TTA falls into the category of statically programmed instruction-level parallelism (ILP) architectures. It belongs to the class of the exposed data path, Very Long Instruction Word (VLIW) processor architectures where the details of the data path transfers are disclosed to the software designer. This can be benefited twofolds by: allowing unique optimizations in the code; and the customization of the data path interconnection network.
TTA is a modular design. Its data path is build of a set of register files (RF), functional units (FU), a control unit, and an interconnection network (IC) between all those resources. The programming model of the TTA differs when compared to the general purpose processor architectures, e.g., RISC data moves on the IC. Data are written into the input ports, and read out from output ports of the FUs. Each unit has a single, triggering input port, which triggers the operation of that unit, whenever data is written into it. Therefore, the operations of the processor can be seen as a "side effect" of data transports.
If the inputs of the FU are registered, a set of operations can be performed on the same set of input data, by triggering the unit with different opcodes. Sharing the operands between different operations of the FU reduces data traffic over the IC, and the need for temporal storage in the RFs or data memory.
The modular nature of the TTA architecture can be exploited to reduce power consumption. The processor structure can be tailored specifically to fit the application it is designed for. Removing unnecessary resources, e.g., FUs, and connection sockets from the IC, reduces the static power consumption due to the reduced gate count. Additional power savings can be obtained with special function units (SFU). These are custom designed FUs, implementing either algorithm specific operation in a optimized way, or performing set operations, that would normally require the use of several general purpose FUs. With SFUs, the number of necessary data moves, as well as, number of program instructions can be reduced. Switching and instruction fetch reduction can significantly decrease power consumption.
The speed or area optimization can be obtained by adding/removing resources to the processor. One can add more resources to improve performance by exploiting ILP. Leaving only necessary resources will reduce the silicon area the processor takes to minimum. A trade-off between speed and area can be worked out according, e.g. to requirements. In this work, we have used several custom-designed units tailored for the FFT application, which guarantee a trade-off between execution time, area, and power consumption of the processor.
IV. PROCESSOR ORGANIZATION
Our processor implements the in-place, decimation-in-time (DIT), in-order output, mixed-radix-4/2/3 FFT algorithm. It supports power-of-two FFT sizes in the range of 128-2048 and additionally transform sizes of of multiples of three, 3 · 2 n . In particular, the 1536-point FFT defined in LTE specification is supported.
The general organization of the proposed TTA processor is presented in Fig. 2 . The processor is composed of 12 general, and special FUs and a total of 16 RFs containing 25, 32-bit general-purpose registers and 6 Boolean registers. The FUs and RFs are connected by an IC consisting of 25 buses. The number of connecting sockets has been hand-optimized down to 111. In addition, the processor has a control unit, instruction memory, and 32-bit, dual-ported data memory. The size of the data memory is limited by the size of the maximum FFT to be supported by the processor, in our case, (2048 + 1). The last word holds the size of the FFT to be computed.
In principle, for a given FFT size, the order of radix stages in mixed-radix computations can be chosen arbitrarily. However, careful investigation revealed that placing radix-3 stage at the end, proved to be the most beneficial from the implementation point of view. Ternary number system is natural for radix-3 computations. In the binary system, ternary bit-level operations are not straightforward, making the radix-3 implementations complex, hence, more power demanding than radix-2 and radix-4 algorithms. However, if we decompose a 1536-point FFT into three 512-point FFTs followed by a single radix-3 stage, the operand address and twiddle factor generators for radix-3 stage can be implemented with binary bit-level operations.
The in-order FFT algorithm, implemented in our design, implies the need for a permutation of the input samples, so that, the correct results are produced. In case of the powerof-two FFTs only permutation is required. For the 1536-point FFT initial decomposition, into three sets, comprising every 3rd sample, is needed. Those operations should be taken into account when integrating our design into the system with a source of samples, and the sink for results. Integration details remain beyond the scope of this paper.
The processor takes complex values for input. The 32-bit word has been adopted to represent the complex numbers where the 16 most significant bits hold the real part, and 16 least significant bits hold the imaginary part. Appropriate, arithmetic special function units were designed to perform the complex multiplication and addition, required by the radix-2 and radix-4 butterflies.
Indexing of the input data for every stage of the FFT computations can be represented by a permutation matrix [14] . However, instead of using generic arithmetic units to calculate the indexes, one can perform index manipulations on the bit-level. This proved to be a low-power solution and was implemented as two address generator (AG) SFUs, a combined one for radix-4 and radix-2 stages, and a separate one for the radix-3 stage.
Another low-power solution was called a twiddle factor (TF) generator. In principle, the generator exploits the redundancy among TFs. Instead of storing all the required TFs in a look-up table, only (1/8 + 1) of them is kept in the memory, while the rest is generated by modifying the stored ones. There two TF generator SFUs, one for radix-4 and radix-2 stages, and one for the radix-3 stage.
The Radix-3 Butterfly Unit implements the algorithm for FFT radix-3 butterfly computations as in [15] . Fig. 3 presents the signal flow graph of the computations. Compared to the classic radix-3 butterfly implementation the number of arithmetic operations has been reduced from six complex multiplications to two complex and two real multiplications by a constant. The number of complex additions is six in both cases.
V. PERFORMANCE ANALYSIS
In order to evaluate our processor, we implemented all SFUs described in Section IV in hand-written VHDL. The structural description of the processor core was obtained with PRODE, a VHDL description generator from the TCE Obtained results are compared in Table I to design in [2] . As can be seen, the proposed design is more than twice smaller than the previous design , while it supports FFT of size 1536 in addition to FFTs of sizes being powers-of-two. At the same time, the proposed design may use about 1.6x higher frequency clock and it uses approximately the same number of clock cycles as the previous design for implementation of FFTs of equal sizes. Thus, faster FFT implementations on a smaller device is achieved.
The processor operates in a pipelined fashion, i.e., in one clock cycle, different FUs process different samples from the same computational set. This is possible due to the TTA architecture, which exposes data transfers on the processor buses to the software designer. The processor does not have a fixed pipeline, typical to other architectures, rather the programmer (or compiler) can create a pipeline by scheduling data transfers in a particular manner. Hand-optimized assembly code guarantees throughput of 1 sample/clock cycle, after latency of up to 57 cycles, needed to fill the pipeline in. The latency increases with the FFT size reaching maximum of 57 cycles for the 2048-point FFT. For instance, the theoretical throughput of one sample per clock cycle for mixed-radix-4/2 2048-point FFT with five radix-4 stages and a radix-2 stage is 2048·6 = 12 288 clock cycles. Our processor requires only 57 cycles more, which is the maximum latency for FFT computations with this design.
The energy-efficiency of the proposed architecture is compared to other FFT processors by measuring how many 1024-point FFTs can be computed with energy of 1 mJ. The comparison results are listed in Table II . The comparison in Table II shows that the energy-efficiency of the proposed processor is in the level of fixed-function ASIC implementations, although the implementation is programmable. It should also be noted that the the proposed processor is the only one, which supports radix-3 computations.
VI. CONCLUSIONS
In this paper, an application-specific processor customised for FFT computations is proposed. The design, based on a TTA processor template, has been customized for the FFT computations, having in mind especially FFT sizes required by the LTE applications. A set of hand-optimized special function units was designed. The processor supports power-of-two FFT sizes in the range 128-2048 and the 1536-point FFT. The latter one is obtain with the added support for radix-3 FFT computations.
The processor was synthesized on a 130 nm ASIC technology. Timing, area and power analysis showed that the proposed design features both low energy-consumption and higher performance. The proposed design is more than twice smaller the previous design, it supports 1536-point FFT, and can be clocked at approximately 1.6 times higher maximum frequency. The hand-optimized assembly code allows the throughput of one sample per clock cycle after short latency.
Future work will include adding the radix-5 support, as well as, expanding the range of computable FFT-sizes to all the sizes of power-of-three. The power-efficiency analysis and optimization shall also be done.
