A 128-point Multi-Path SC FFT Architecture by Hsu, Shun-Che et al.
A 128-point Multi-Path SC FFT Architecture 
  
Shun-Che Hsu & Shen-Jui Huang  
Intelligo Technology Inc. 
Hsinchu City, Taiwan 
shenray.ee95g@g2.nctu.edu.tw 
Sau-Gee Chen & Shin-Che Lin 
Insitute of Electronics  
Natioanl Chiao Tung University  
Hsinchu City, Taiwan             
sgchen@g2.nctu.edu.tw 
Mario Garrido 
Department of Electrical Engineering  
Universidad Politécnica de Madrid  
Madrid, Spain 
mario.garrido@upm.es 
   
 
Abstract—This paper presents a new radix-2k multi-path 
FFT architecture, named MSC FFT, which is based on a  
single-path radix-2 serial commutator (SC) FFT architecture. 
The proposed multi-path architecture has a very high 
hardware utilization that results in a small chip area, while 
providing high throughput. In addition, the adoption of 
radix-2k FFT algorithms allows for simplifying the rotators 
even further. It is achieved by optimizing the structure of the 
processing element (PE). The implemented architecture is a 
128-point 4-parallel multi-path SC FFT using 90 nm process. 
Its area and power consumption at 250 MHz are only 0.167 
mm2 and 14.81 mW, respectively. Compared with existing 
works, the proposed design reduces significantly the chip area 
and the power consumption, while providing high throughput.  
Keywords—SC FFT, radix-2k FFT, multi-path structure 
 
I.  INTRODUCTION  
 
Fast Fourier transform (FFT) is widely used in advanced 
communication systems, particularly in those that adopt the 
transmission techniques of orthogonal frequency division 
multiplexing (OFDM) and single-carrier frequency division 
multiple access (SC-FDMA). For instance, the most 
advanced mobile R15 5G NR system, IEEE 802.11ax and 
802.11ay. These systems demand a throughput an order of 
magnitude higher than their predecessors, which leads to the 
need of high-performance FFTs, specifically the parallel FFT 
architectures.  
Since last decades, many pipelined FFT hardware 
architectures have been proposed: Single-path delay 
feedback (SDF) [1], which processes serial data and includes 
feedback loops; multi-path delay feedback (MDF) [2-10], 
which processes parallel data and includes feedback loops;  
single-path delay commutator (SDC) [11-12], which 
processes serial data without feedback loops; multi-path 
delay commutator (MDC) [13-15], which processes parallel 
data without feedback loops; and the recently proposed 
single-path serial commutator (SC) FFT architecture [16], 
which makes use of circuits for bit-dimension permutation of 
serial data. The SC FFT architecture has the advantage of 
eliminating the low utilization problem of SDF FFTs thanks 
to a novel data management scheme that reorders butterfly 
and rotation operations. As a result, the number of adders 
and multipliers in the processing element (PE) of each stage 
are reduced by half.  
Although the SC FFT is very hardware-efficient, it 
considers only a single serial input and output data streams.  
 
cosα
sinα
 
Fig. 1.  Processing element of the SC FFT. 
 
Consequently, it is hard to meet the throughout 
requirement of those advanced communication systems 
mentioned above. In this paper, we propose a new parallel 
multi-path FFT architecture by integrating parallel copies of 
SC FFT architecture and thoroughly optimizing the 
multi-path FFT architecture by fully utilizing all kinds of 
optimization techniques as much as possible. As a result, all 
the merits of high hardware utilization, high area efficiency, 
low power consumption and high throughput can be 
simultaneously achieved for the proposed MSC FFT 
architecture As a demonstration of the proposed design 
concept, we design a 128-point 4-parallel MSC FFT 
architecture that explores and exploits high-radix FFT 
algorithms, i.e., radix-23 and 24 FFT algorithm for obtaining 
an optimized result as much as possible. Moreover, we 
incorporate the rotator allocation techniques in [17], in order 
to increase the hardware utilization even further.  
This paper is organized as follows. In Section II, we 
describe the background of this work. In Section III, the 
proposed MSC FFT architecture is presented. In Section IV, 
experimental results and comparison with previous works are 
provided, followed by Section V, the conclusion.  
 
II. BACKGROUND 
 
A. The fast Fourier transform  
 
The N-point discrete Fourier transform (DFT) for an 
input sequence, [ ]x n , is defined as  
1
0
[ ] [ ]
N
n
kn
NX k x n W
−
=
= ,     k = 0, 1, …, N-1, (1) 
where X[k] is the kth DFT coefficient and 2 /=nkN
j nk NW e π− . 
Fast Fourier transform (FFT) algorithms are applied to 
 (a)
(b)
Stage 1 Stage 2 Stage 3
−
−
−
−
−
−
−
−
−
−
−
−
Stage 1 Stage 2
−
−
−
−
x[n]
x[n+N/4]
x[n+2N/4]
x[n+3N/4]
X[4k]
X[4k+2]
X[4k+1]
X[4k+3]
x[n]
x[n+N/8]
x[n+2N/8]
x[n+3N/8]
x[n+4N/8]
x[n+5N/8]
x[n+6N/8]
x[n+7N/8]
X[8k]
X[8k+4]
X[8k+2]
X[8k+6]
X[8k+1]
X[8k+5]
X[8k+3]
X[8k+7]
(c)
Stage 1 Stage 2 Stage 3 Stage 4
x[n+N/16]
x[n]
x[n+2N/16]
x[n+3N/16]
x[n+4N/16]
x[n+5N/16]
x[n+6N/16]
x[n+7N/16]
x[n+8N/16]
x[n+9N/16]
x[n+10N/16]
x[n+11N/16]
x[n+12N/16]
x[n+13N/16]
x[n+14N/16]
x[n+15N/16]
X[16k]
X[16k+8]
X[16k+4]
X[16k+12]
X[16k+2]
X[16k+10]
X[16k+6]
X[16k+14]
X[16k+1]
X[16k+9]
X[16k+5]
X[16k+13]
X[16k+3]
X[16k+11]
X[16k+7]
X[16k+15]
-j
WN2n
WNn
WN3n W81
W82
W83
WN2n
WN4n
WN6n
WNn
WN5n
WN3n
WN7n
-j
-j
-j
-j
-j
-j
W81
W82
W83
W81
W82
W83
W161
W162
W163
W164
W165
W166
W167
WN8n
WN4n
WN12n
WN2n
WN10n
WN6n
WN14n
WNn
WN9n
WN5n
WN13n
WN3n
WN11n
WN7n
WN15n
 
Fig. 2. Signal flow graphs for different radices. (a) Radix-22. (b) Radix-23. 
(c) Radix-24.  
 
reduce the computational cost of the DFT, leading to a 
complexity of O(Nlog2N) which is much less than O(N2) for 
direct computation of the DFT. During the FFT process, the 
major basic computation involved is the multiplication of a 
data with so called twiddle factors 2 /=mL
j m LW e π− , where m 
and L are some positive integers. The operation is also 
equivalent to a vector rotation by an angle of 2 / .m Lπ  
 
B. The SC FFT architecture 
  
    Fig. 1 shows the processing element of the SC FFT in 
[16]. It consists of a half-butterfly, a half-rotator, input and 
output circuits for data reordering. Note that data arrives 
serially at a rate of one sample per clock cycle, being the 
two parallel paths in the figure for the real and imaginary 
parts of the same sample.  
 
C. Symmetric angle sets 
 
A symmetric angle set (SAS) [17,18] is a set of angles of 
the form nπ/2 ±α, where n = 0,…,3 and α∈[0, π/4]. Any 
rotation in a symmetric angle set can be calculated as a 
rotation by an angle α∈[0, π/4], a trivial rotation and an 
exchange of the real and imaginary part of the rotation 
coefficient.  
 
D. M-rotator 
 
An M-rotator or M-rot [18] is a rotator that can rotate a 
number of angles in M different symmetric angle sets. For 
instance, a rotator that rotates by 0º, 45º and 135º is a 2-rot as 
it rotates angles in the symmetric angle sets nπ/2 and nπ/2 ± 
π/4. Likewise, the twiddle factor W8 is a 2-rot, the twiddle 
factor W16 is a 3-rot and the twiddle factor W32 is a 5-rot. In 
general, a twiddle factor WL is an L/8+1-rot. This means that 
there are L/8+1 angles in the range [0, π/4] and the rest of the 
angles can be obtained by utilizing the symmetry property. 
 
Fig. 3.  PE with W4 rotator (PE_W4). 
 
Fig. 4.  PE with W8 rotator (PE_W8). 
 
D
D –
+
D
1
0
0
1
1
0
0
1
0
1
2
D
D 1
0
0
1
+
S3_1
S3_1
S3_2
W16_MUL
W16_MUL
S3_4
Real W16 rotator
RIN
IIN
ROUT
IOUT
+
‒
<< 6
+
<< 2
<< 3
<< 5
<< 1
5x 40x
160x
10x
64x
4x
<< 8
<< 9
When S = 0, out = 473x
When S = 1, out = 362x
When S = 2, out = 196x
x
0
1
2
0
1
2
0
1
2
512x
256x
S3_3
W16_MUL
S
-1
out
S3_1S3_5
Fig. 5.  PE with W16 (PE_W16). 
 
 
III. PROPOSED MULTI-PATH SC FFT ARCHITECTURE 
  
A. Simplification of the PE for radix-2k 
 
Fig.2 shows the flow graphs of radices 22, 23 and 24 
FFTs which are based on decimation-in-frequency (DIF) 
decomposition. The proposed architecture is based on these 
building blocks. One can observe that the rotation 
operations in the flow graph only appear at the lower edges 
of the butterfly outputs. This translates into the fact that only 
the lowest path of the PE in SC FFT needs to be rotated. As 
a result, the PEs in Figs. 3 to 6 are derived.   
Fig. 3 shows the realization of the PE for the stages 
where a W4 rotation is applied. In this case, only a -j rotation 
is needed. The S1_2 signal selects a rotation by 1 or –j. Fig. 
4 shows the realization of the PE for the stages where W8 
rotation is operated. In this case, the constant 0.707 in 
W81=0.707-0.707j and W83=-0.707-0.707j is moved to the 
input of the rotator. As a result, the rotator only requires one 
constant multiplier. With proper control signal setting, this 
PE calculates all the rotations only with W8 twiddle factor. 
The constant multiplication by 0.707 can be implemented as 
four simple shift-and-add operations for 12-bit precision, as 
can be easily shown.  
+ −
×
×
+ −
×
×
S4_1
Fig. 6.  PE with general complex rotator (PE_TW). 
 
Fig. 5 shows the implementation of the PE for the stages 
where a W16 rotation is required. By using the symmetry 
property around the unit circle, the rotator only has 3 SAS. 
Therefore, it only needs to rotate by W160,W161 and W162, 
whereas the rest of rotations are obtained by symmetries. If 
we consider that W161 = (473-196j)/512, and W162 = 
(362-362j)/512, the rotator only needs to multiply its inputs 
by 473/512, 362/512, or 196/512. This allows for sharing 
the resources to calculate the multiplications by all these 
values. With proper setting of control signals, W16_MUL 
can calculate the multiplication by 473, 362, or 196 using 
only three real adders.  
In the final stage of the radix-2k flow graphs in Fig. 2, 
rotations occur both at the upper and lower output paths of 
each butterfly. Therefore, a complex multiplication is 
required in these stages. Fig. 6 shows the corresponding PE.  
 
B. Multi-path SC FFT Architecture 
 
Fig. 7 shows the proposed 128-point 4-parallel MSC FFT 
architecture. The FFT is realized with a combination of 
radix-23 and radix-24 algorithms, as described in Section 
III-A. This results in the fact that general complex rotators 
(PE_TW) are only needed at stage 3 of the architecture. The 
rest of stages use the simpler PEs described in Section III-A. 
TABLE I.  ROTATOR DISTRIBUTION AT STAGES 4 TO 6 OF THE 
PROPOSED MSC FFT 
 Stage 4 Stage 5 Stage 6 
Path 1 W160 , W164 W80 W40 
Path 2 W161 , W165 W81 W40 
Path 3 W162 , W166 W82 W40 
Path 4 W163 , W167 W83 W41 
 
In addition to combining radix-23 and radix-24, the 
proposed SC FFT applies the ideas of rotator allocation [17] 
to simplify the rotators even further. Table I shows the 
rotations at the parallel paths of stages 4, 5 and 6. One can 
observe that in the upper path of stage 4, it only has to 
calculate rotations by W160 and W84, which are trivial. Hence, 
a PE_W4 is enough here, which much simplifies the rotator 
complexity. Likewise, other rotators at stages 4, 5 and 6 are 
simplified by following the same ideas. These rotators 
correspond to the shaded blue modules in Fig. 7.  
 
0
1
0
1
0
1
PE_W8 PE_W4 PE_ TW PE_W4 PE_W4
PE_W16
Path 1
Radix-23 Radix-24
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
D7 D3 D1 D15
BU2
PE_W8 PE_W4 PE_ TWD7 D3 D1 D15
PE_W8 PE_W4 PE_ TW PE_W8 PE_W4D7 D3 D1 D15
PE_W8 PE_W4 PE_ TWD7 D3 D1 D15
Path 2
Path 3
Path 4 PE_W16
PE_W8
PE_W8
BU2
BU2
BU2
-j
b2b5b4b3b6
b0
b1
b2b6b4b3b5
b0
b1
b2b6b5b3b4
b0
b1
b2b6b5 b4b3
b0
b1
b3 b6b5b4b2
b0
b1
b3 b6b5b4b2
b1
b0
b3b6b5b4b2
b0
b1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
 
Fig. 7. Proposed 4-parallel radix-23/24 128-point MSC FFT architecture. 
 
 
IV. EXPERIMENTAL RESULTS AND COMPARISON 
 
To verify its performance, the proposed MSC FFT 
architecture has been synthesized by using the TSMC 90 nm 
technology. Ten thousand random input patterns of 1 or -1 are 
applied to the FFT architecture. The experimental results are 
shown in the second column of Table II. The architecture 
works at a frequency of 250 MHz, leading to a throughput of 1 
GS/s due to the 4 parallel paths. The word length used in the 
architecture is 12 bits. This results in a SQNR of 42.7 dB. The 
pre-layout gate-level synthesis results show that the whole 
FFT architecture area is 0.167 mm2. Finally, the power 
consumption is 14.81 mW at the clock rate of 250 MHz. 
Columns 3 to 7 in Table II show the synthesis results of 
compared 128-point multi-path FFT architectures. All the 
compared designs are based either on an MDC or MDF FFT 
architectures which are derived from their single-path 
predecessors, namely, SDC and SDF. In contrast, the proposed 
MSC FFT architecture is the first parallel FFT architecture 
derived and optimized from the highly-efficient single-path 
SC FFT.  
Together with the proposed architecture, the works of [5], 
[8] and [9] are all 4-parallel architectures. For a proper area 
comparison, we have normalized the area figures of all the 
compared designs according to the following commonly  
TABLE II.  COMPARISON OF 128-POINT MULTI-PATH FFT ARQUITECTURES 
 This work [5] [8] [9] [10] [15] 
FFT architecture 4-parallel MSC 4-parallel MDF 4-parallel MRMDF 4-parallel MDF 8-parallel MDF 2-parallel MDC 
FFT algorithm radix-23/24 radix-24 radix-8/2 radix-24 radix-23/24 radix-2 
FFT size 128 128 128/64 128 128 128 
Tech. (nm) 90 @1.0V 180 @1.8V 180 @1.8V 180 @1.8V 90 @1V 65 @0.7V 
Word length 12 bits 12 bits 10 bits 10 bits 8 bits 12bits 
Working frequency 250 MHz 200 MHz 250 MHz 450 MHz 260 MHz 160 MHz 
A (mm2) 0.167 - 3.097 (excl. test module) - 
0.53 
(excl. test module) 0.34 
Gate count 59049 93200 - 130000  - 
P (mW) 14.81 @1 GS/s 
132 
@800 MS/s 
175 
@1 GS/s - 
6.8 
@409.6 MS/s 
2.12 
@317.25 MS/s 
Normalized Area:   
AN (mm2) 
0.167 
(26%) - 
0.774 
(100%) - 
0.53 
(68%) 
0.652 
(84%) 
Normalized Power:  
PN (mW) 
6.75 
@440 MS/s (56%) 
11.2* 
@440 MS/s 
(94%) 
11.97 
@440 MS/s (100%) - 
7.3 
@440 MS/s (61%) 
8.31 
@440 MS/s (69%) 
SQNR 42.7dB 40dB - 33dB 26.4dB - 
Throughput rate 
(R: clock rate) 4R 4R 4R 4R 8R 2R 
Experimental results Post-synthesis Post-synthesis Post-implementation Post-synthesis Post-implementation Post-synthesis 
    *: This value has been normalized by multiplying the working frequency rate. 
 
 
adopted formula, where the proposed design is used as the 
benchmark design: 
 2( . / 90 )N
A
A
Tech nm
= ,    (2) 
where NA  is the normalized area, A is the area, and Tech. is 
process feature size of the compared design. For the proposed 
approach, the process size is 90nm . 
It is observed that the proposed design requires significantly 
less area than the compared 4-parallel designs: Compared to 
[5], the area of the proposed design is 61% smaller in terms of 
gate counts. With respect to [8], the proposed design requires 
around one-fourth area of [8], while compared to [9], the 
proposed design consumes less than half of its gate count. 
 
To compare the power consumption, similarly the power 
data are normalized with respect to the proposed design by 
using the following formula: 
 2( . / 90 ) ( /1.0)N
P
P
Tech nm V
=
×
,  (3) 
where the number 1.0 in the formula is the operating voltage 
of the proposed implementation, and V, P and PN are the 
operating voltage, power consumption and normalized power 
consumption of the compared design, respectively.  
At 440 MS/s, the proposed approach consumes only 60% 
and 56% of the total power consumption in [5] and [8], 
respectively. This represents savings of 40% or larger in 
power consumption.  
 
 
Finally, the SQNR is calculated as explained in [20]. By 
comparing the results in the table, it can be observed that the 
proposed approach achieves the highest SQNR among all the 
128-point multi-path FFT designs. 
 
V. CONCLUSIONS 
 
In this work, we have presented the first multi-path SC FFT 
(MSC FFT). The demonstration design is a 128-point 4-parallel 
radix-23/24 MSC FFT. The use of an SC-based processing 
element facilitates the minimization of the hardware resources 
in the architecture. Furthermore, the utilization of radix-23/24 
FFT combined with optimized rotator allocation results in 
highly hardware-efficient rotator architectures. Compared with 
the existing designs with the same FFT size and degree of 
parallelism, the proposed architecture achieves savings by 
more than 40% in both area and power consumption, while 
provide very high throughput. The proposed architecture can 
be extended to higher parallelism than four with much higher 
throughputs which can meet the requirements of current and 
future most demanding communication systems (with 
throughput of more than 10Gpbs), such as 5G mobile systems, 
IEEE802.11ax and IEEE 802.11ay systems. Those application 
designs are currently under investigation.  
 
REFERENCES 
 
[1] S. He and M. Torkelson, “Design and implementation of a 1024-point 
pipeline FFT processor,” in Proc. IEEE Custom Integrated Circuits 
Conf., May 1998, pp. 131-134. 
[2] N. Li and N. P. van der Meijs, “A radix-22 based parallel pipeline FFT 
processor for MB-OFDM UWB system,” in Proc. IEEE Int. SOC Conf., 
2009, pp. 383–386. 
[3] J. Lee, H. Lee, S. I. Cho, and S.-S. Choi, “A high-speed, low-complexity 
radix-24 FFT processor for MB-OFDM UWB systems,” in Proc. IEEE 
Int. Symp. Circuits Syst., 2006, pp. 210–213. 
[4] H. Liu and H. Lee, “A high performance four-parallel 128/64-point 
radix-24 FFT/IFFT processor for MIMO-OFDM systems,” in Proc. 
IEEE Asia–Pacific Conf. Circuits Syst., Nov. 2008, pp. 834–837. 
[5] S.-I. Cho, K.-M. Kang, and S.-S. Choi, “Implemention of 128-point fast 
Fourier transformprocessor for UWB systems,” in Proc. Int. Wirel. 
Commun. Mobile Comput. Conf., 2008, pp. 210–213. 
[6] C.-H. Yang, T.-H. Yu, and D. Markovic, “Power and area minimization 
of reconfigurable FFT processors: A 3GPP-LTE example,” IEEE J. 
Solid-State Circuits, vol. 47, no. 3, pp. 757–768, Mar. 2012. 
[7] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, “A 2.4-GS/s FFT processor for 
OFDM-based WPAN applications,” IEEE Trans. Circuits Syst. I, 
Regular Papers, vol. 57, no. 6, pp. 451–455, Jun. 2010. 
[8] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A 1-GS/s FFT/IFFT processor for 
UWB applications,” IEEE J. of Solid-State Circuits, vol. 40, pp. 1726 – 
1735, Aug 2005. 
[9] Minhyeok Shin and Hanho Lee, “A high-speed four-parallel 
radix-24 FFT/IFFT processor for UWB applications,” in Proc. IEEE Int. 
Symp. Circuits Syst., Seattle, WA, 2008, pp. 960-963. 
[10] S. Tang, J. Tsai and T. Chang, “A 2.4-GS/s FFT Processor for 
OFDM-Based WPAN Applications,” in IEEE Transactions on Circuits 
and Systems II: Express Briefs, vol. 57, no. 6, pp. 451-455, June 2010. 
[11] X. Liu, F. Yu, and Z. ke Wang, “A pipelined architecture for normal I/O 
order FFT,” Journal of Zhejiang University - Science C, vol. 12, no. 1, 
pp. 76-82, Jan, 2011. 
[12] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order 
pipeline FFT design,” IEEE Trans. Circuit Syst. II: Express Briefs, vol. 
55, no. 12, pp. 1234-1238, Dec. 2008. 
[13] M. Garrido, J. Grajal, M.A. Sanchez, and O. Gustafsson, “Pipelined 
radix-2k feedforward FFT architectures,” IEEE Trans. VLSI, vol. 21, 
issue. 1, pp. 23-32, 2013. 
[14] M. A. Sánchez, M. Garrido, M. L. López, and J. Grajal, “Implementing 
FFT-based digital channelized receivers on FPGA platforms,” IEEE 
Trans. Aerosp. Electron. Syst., vol. 44, no. 4, pp. 1567–1585, Oct. 2008. 
[15] T. Ahmed, “A low-power time-interleaved 128-point FFT for IEEE 
802.15.3c standard,” in Proc. Int. Conf. Informatics, Electronics Vision, 
Dhaka, 2013, pp. 1-5. 
[16] Mario Garrido, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson, 
“The Serial Commutator (SC) FFT,” IEEE Trans. Circuits Syst. II: 
Express Briefs, Vol. 63, No. 10, pp. 974-978, Oct. 2016. 
[17] Mario Garrido, Shen-Jui Huang and Sau-Gee Chen, “Feedforward FFT 
hardware architectures based on rotator allocation,” IEEE Trans. 
Circuits Syst. I: Regular Papers, Vol. 65, No. 2, pp. 581-592, Feb. 2018. 
[18] R. Andersson. “FFT hardware architectures with reduced twiddle factor 
sets,” Master’s thesis. Dpt. Of Electrical Engineering, Linköping 
University, Jun. 2014. 
[19] M. Garrido, J. Grajal and O. Gustafsson. “Optimum circuits for bit- 
dimension permutations,” IEEE Trans. on VLSI, Vol 27, No. 5, pp. 
1148-1160, May 2019. 
[20] D. Guinart and M. Garrido, “SQNR in FFT hardware architectures,” 
Under review.
 
