A Small-Area HighPerformance 512-Point 2-Dimensional FFT Single-Chip by Naoto Miyamoto et al.
A Small-Area High-Performance 512-Point
2-Dimensional FFT Single-Chip Processor
Naoto Miyamoto
*, Leo Karnan
*, Kazuyuki Maruo
0, Koji Kotani
* and Tadahiro Ohmi
1
*University of Tohoku, Graduate School of Engineering, 10Aoba,Aramaki,Aoba, Sendai, Japan, 980-8579
0Advantest Laboratories Ltd., 48-2 Matsubara, Kamiayashi,Aoba, Sendai, Japan
1University of Tohoku, New Industry Creation Hatchery Center, 10Aoba,Aramaki,Aoba, Sendai, Japan
Phone: +81-22-217-3977   FAX: +81-22-217-3986 miyamoto@fff.niche.tohoku.ac.jp
#1
Register
#2
Register
512-word
SRAM
#1
Register
#2
Register
512-word
SRAM
: butterfly
Radix-23
Computation Element Cache Memory
Main Memory
change alternately
Figure 1. System block diagram
EXE READ
WRITE READ
WRITE READ
EXE EXE WRITE READ
#1 Reg.
#2 Reg.
WRITE EXE READ
A Radix-2n Computation
EXE #2 Reg. EXE
balanced balanced
EXE WRITE READ EXE
WRITE READ
#1 Reg.
imbalanced imbalanced
WRITE
READ
( cache fill & flush time  < radix-2n processing time )
Figure 2. Operation sequence for each register. It is required
to keep a balance between the memory access cycle and the
execution cycle to achieve 100% efficiency.
Abstract - We have designed an FFT processor based on the 
2-stage cached-memory architecture, which integrates 552,000
transistors within an area of 2.8 x 2.8 mm
2 with CMOS 0.35µm 
triple-layer-metal process. This processor can execute a
512-point, 36-bit-complex fixed-point data format, 
1-dimensonal FFT in 23.2 µsec and a 2-dimensional one in only
23.8 msec at 133MHz operation. We have measured this 
processor consumes 439.6mW at 3.3V, 100MHz operation. 
I Introduction 
The fast Fourier transform (FFT) transmutes a set of data
from the time domain to the frequency domain or vice versa.
Since it saves a substantial amount of computation over the
conventional discrete Fourier transform (DFT) methods [1],
the FFT is widely used in various signal processing. Though
early implementations of the FFT algorithms have been
mainly in software running on general-purpose computers
[2], with increasing demands for high-speed signal
processing, the efficient hardware implementation of FFT
processors is an important key to develop many future
generations of advanced multi-dimensional digital signal and
image processing systems.
This paper describes a small-area, high-performance
512-point FFT processor using both the cached-memory
architecture (CMA) and the double buffer structure. The
CMA makes it possible to significantly reduce the hardware
resource of the computation element as compared to the 
single-memory or dual-memory architecture, even though
the radix number is high. Utilizing the 2-stage CMA based
on the double buffer structure, a high-performance, multi-
dimensional FFT is achieved.
II. Architectural Features
A. The  Cached-Memory Architecture
Figure 1 shows the system block diagram using the CMA.
This is similar to the single-memory architecture except that
a small cache memory consisting of a pair of registers
resides between the computation element and the main
memory. Preparing a pair of registers enables hiding the
memory access cycle behind the computation cycle shown in
Fig. 2. A tightly-coupled processor-cache pair increases the
effective bandwidth to the main memory, so that once
2
n-word data is fetched to the cache, the processor need not
access the main memory during the radix-2
n execution.
radix-2 butterfly
1st Pass 2nd Pass 3rd Pass
0
4
2
6
1
5
3
7
0
4
2
6
1
5
3
7
0
4
2
6
1
5
3
7
0
4
2
6
1
5
3
7
0
2
4
6
1
3
5
7
0
1
4
5
2
3
6
7
signal address : interconnect switch
: memory bus
(a) Normal 8-word computation (b) Resource-saving multi-datapath
based on radix-2 decomposition radix-2
3computation element
Figure 3. Dataflow diagram of the 8-word group
B. Radix-2
3 Computation Element
In the radix-2 FFT algorithm, an 8-word transformation
must be done by 4-row x 3-column butterfly operations
shown in Fig. 3(a). Since the butterfly execution unit
composed of a complex multiplier, adder and subtracter is 
huge, it is very important to reuse the computation resources
in order to shrink the size of the chip. To achieve this, we
have developed the resource-saving multi-datapath radix-2
3
computation element (RM-R2
3CE) as shown in Fig. 4.
Since each column is composed of 4 independent
butterflies and data flows in a consistent pattern in the
8-word groups throughout the entire 512-word transform,
this symmetry and regularity enable to divide the normal
8-word computation into 3 passes shown in Fig. 3(b). Each
pass is cyclically processed by a single RM-R2
3CE
consisting of 4 radix-2 butterfly units. Using the RM-R2
3CE,
all of the communication is concentrated into the inter-
mediate step and each signal is easily sorted and stored into 
the register which is the closest to the following butterfly
unit by changing the address of the data stored in the cache
one after the other. In this manner, RM-R2
3CE can reduce
the processor area down to 1/3 and the interconnect area 
down to half without any penalties in delay.
Proceedings of the 2004 Asia and South Pacific Design Automation Conference (ASP-DAC’04) 
0-7803-8175-0/04 $ 20.00 IEEE Off-Chip
Memory
#1
Reg.
#2
Reg.
#1
single-port
SRAM
#2
single-port
SRAM
Level-1 Cache Level-2 Cache
: butterfly
Macro Processor
EXE
READ WRITE
512-point 1-dimensional FFT
READ WRITE
Normal CMA Normal CMA
EXE
WRITE READ
EXE EXE WRITE READ
2 2- -Stage CMA Stage CMA
#1  SRAM
#2 SRAM
READ WRITE
READ WRITE
Memory access time is hidden behind the execution time
Overhead exists between consecutive FFTs
Figure 5. Block diagram and operation sequence of the
2-stage CMA using the double buffer structure
Radix-2
Butterfly
Radix-2
Butterfly
Radix-2
Butterfly
Radix-2
Butterfly
Select
#1 8-word
register
From SRAM
From SRAM To SRAM
To SRAM
Resource-saving
Routing Switches
#2 8-word
register
0
1
MUX
1st,3rdPasses
2nd Pass
Figure 4. Block diagram of the resource-saving
multi-datapath radix-2
3 computation element (RM-R2
3CE)
C. 2-Stage Cached-Memory Architecture
The double buffer structure is employed to execute both
FFT and data transfer with off-chip memory simultaneously
as shown in Fig. 5. It indicates that no overhead exists
between consecutive 1-dimensional FFTs because the
memory access time is concealed behind the execution time. 
On the other hand, the processor must be stalled and forced
to wait for data in the original CMA without the double
buffer structure. This is important for the high-speed and
small-area 2-dimensional FFT processor where a series of
1-dimensional FFTs must be executed.
The dual-port SRAM is sometimes used to increase the
datapath width by pipeline FFT processors. Since dual-port
SRAM occupies about double the area as compared to the
single-port SRAM, it is difficult to adopt the double buffer
structure. However the CMA with RM-R2
3CE can achieve
100% efficiency even with the single-port SRAM
configuration. It suggests that the combination of the CMA,
RM-R2
3CE and double buffer structure is extremely well 
suited to the 2-stage CMA.
III. Chip Design and Conclusions 
#1 Data SRAM
#2 Data SRAM
Coefficient SRAM
Technology
CMOS 0.35µm
3-layer-metal
Layout Area
2.8 㬍 2.8 mm2
Core : 2.8mm 䂔
Figure 6. Die photo of the proposed FFT processor
㪈㪇㪇
㪈㪃㪇㪇㪇
㪈㪇㪃㪇㪇㪇
㪈㪇㪇㪃㪇㪇㪇
㪈㪈 㪇 㪈 㪇 㪇
P
o
w
e
r
 
C
o
n
s
u
m
p
t
i
o
n
[
m
W
]
1-Dim. 512-Point FFT Execution Time [µsec]
E䋮Bidet
M. Wosnitza
Spiffee @ 3.3V
Cobra
Cordic FFT
2-Dimensional FFT
Convolution constant Power㬍Time
105
104
103
102
11 0 2
㶎( datapath width [bit], chip size[mm2], technology[µm ] )
(16, 36, 0.6 GaAs)
(23, 127.61, 0.75)
(20, 49.1, 0.7(poly=0.6))
10
(10, 100, 0.5)
(36x8, 7.84, 0.35)
(32, 167, 0.5)
Proposed
Figure 7. Power consumption vs. 1-Dim. FFT exec. time
Figure 6 shows the die photo of the specially designed
2-stage CMA FFT processor using RM-R2
3CE. This
processor has been designed with 0.35µm, 3-layer CMOS
technology and occupies an area of only 7.84mm
2.
Table I shows the key features of the 2-stage CMA FFT
processor in comparison with the previous works [3],[4].
This table shows that our proposed FFT processor is
superior in terms of the datapath width, the 2-dimensional
FFT calculation speed, and especially the area of the silicon
even though the difference in technology is taken into
account. In this table, we assume constant throughput in the
performance estimation. We have measured this FFT proces-
TABLE I  Key features of the proposed FFT processor 
Proposed Spiffee[3] NTT LSI 
Lab.[4]
Technology 0.35µm 0.7µm
(Lpoly=0.6µm) 0.8µm
Number of
Transistors
Logic : 192k 
SRAM : 360k 
Total : 552k 
460k 380k
Datapath Width
36x8-bit
fixed point
20-bit
fixed point
24-bit
floating
Area 7.84 mm
2 49 mm
2 134 mm
2
Clock Freq. 133MHz 173MHz 40MHz
512-point FFT 23.2µsec (estimated)
14µsec
(estimated)
62µsec
512x512-point
2-D FFT
㧙FSB=66MHz 23.8msec
(estimated)
30msec
(estimated)
>63msec
sor and confirmed the correct behavior. The power con-
sumption of the proposed processor is 439.6mW at 3.3V,
100MHz which is as small as the other processors [3]-[6]
specific to only 1-dimensional FFT and smaller than [7]
which is for 2-dimensional convolution as shown in Fig. 7.
Acknowledgements
The VLSI chip in this study have been designed and fabricated
in the chip fabrication program of VLSI Design and Education
Center (VDEC), the University of Tokyo with the collaboration by
Synopsys and Cadence CAD tools and by Rohm Corporation and
Toppan Printing Corporation.
References
[1] W. Cooley et al., Vol.9, pp. 292-301, 1965.
[2] Tom Chen, et al., IEEE Trans. on VLSI Systems, Vol. 7, No. 2, pp.
174-182, 1999.
[3] Bevan M. Baas, JSSC, Vol.34, No. 3, pp. 380-387, 1999.
[4] H. Miyanaga, et al., ICASSP, Vol. 2, pp. 1193-1196, 1991.
[5] E. Bidet, et al., JSSC, Vol. 30, No. 3, pp. 300-305, 1995.
[6] R. Sarmiento, et al., IEEE Trans. on VLSI Systems, Vol. 6, No. 1, pp.
18-30, 1998.
[7] M. Wosnitza, et al., IEEE International Solid State Circuits Conference,
Vol. 41, pp. 118-119, 1998.
Proceedings of the 2004 Asia and South Pacific Design Automation Conference (ASP-DAC’04) 
0-7803-8175-0/04 $ 20.00 IEEE 