An in-place processor design for real-value FFTs targeting in-situ dynamic ADC test by Mullane, Brendan & O'Brien, Vincent
  
Abstract — This paper presents a processor architecture for 
Fast Fourier Transform computation of real-valued signals for 
on-chip analog to digital converter test and evaluation. The 
design performs a radix-2 technique optimized for low area 
overhead and easy integration into system on chips. The 
hardware logic supports variable transform lengths and accurate 
parameter extraction. The processor has been validated on 
0.18um CMOS silicon and applied to a data converter test 
application for extraction of dynamic parameters that are 
SINAD, SFDR and THD. The architecture is suitable for safety-
critical applications where spectral integrity of the converter 
signal path can be run at start-up or during interval down times. 
 
Index Terms— Fast Fourier Transform (FFT), in-place,  
CORDIC, Built-In-Self-Test (BIST), Real Value Fast Fourier 
Transform (RFFT), Analog to Digital Converter (ADC). 
 
I. INTRODUCTION 
eal-value Fast Fourier Transforms (RFFTs) find use in 
many application specific ICs such as biomedical and 
industrial systems designs [1] whereby analog signals are 
converted and processed in the digital domain via an Analog 
to Digital Converter (ADC) component. These front-end 
physical signals are mostly real and after FFT transformation 
exhibit conjugate symmetry which give rise to redundancies 
that can be exploited in logic design [2]. ADC Built-in-Self-
Test (BIST) is an application area where RFFTs can be used to 
monitor dynamic spectral performance in-situ on-chip. The 
purpose of this paper is to explore a programmable 
architecture that trades-off speed & area for minimal 
complexity, enabling a versatile design for mixed-signal 
System on-Chips (SoCs) capable of accurate ADC dynamic 
test calculations. The architecture is well suited to moderate 
speed, low-area and low-power on-chip applications. SoC data 
converter test techniques typically use DSP-based testing of 
analog and mixed-signal circuits [3], whereby an on-chip 
processor is used to compute the converter’s static and 
dynamic performance parameters. The ADC spectral 
performance is evaluated by performing a FFT on the sampled 
results of a coherent sinewave applied to the input of the 
ADC. Other ADC BIST techniques also exist which rely on 
signal processing calculations that provide static and dynamic 
assessment of the converter device under test [4, 5].  
 
In this paper, an in-place memory-based CPU architecture 
is proposed that computes the RFFT based on a radix-2 
decimation in time algorithm. The architecture avoids 
pipelining and cache memory requirements, featuring 
application specific operations that help with the FFT and 
ADC test execution. Complex datapaths operations are also 
avoided, while simple conflict-free memory addressing 
techniques are employed to ease logic overhead. The key 
contribution is an in-place programmable CPU with FFT 
supports operating on real-valued data that uses less logic. The 
solution enables accurate ADC dynamic parameter extraction 
and spectral analysis for in-situ chip applications requiring 
high reliability. The RFFT operations and architectural aspects 
is presented in Section II. Section III covers the processor 
implementation and the test chip design results are discussed 
in Section IV. Finally conclusions are drawn in Section V. 
II. RFFT OPERATION AND ARCHITECTURAL ASPECTS 
The typical ADC test flow for spectral test and evaluation 
involves sinewave application/storage, time domain to 
frequency conversion and parameter extraction. A simple, but 
inefficient way to transform a time-discrete periodic signal 
from time to frequency domain is the Discrete Fourier 
transform (DFT), however a DFT requires N2 complex 
multiplications for an N-point DFT.  A significant reduction in 
computation time and resources can be achieved by employing 
a Fast Fourier Transform (FFT). The FFT decomposes the 
process into successively smaller DFT computations with an 
N-point FFT reduced to multiple 2-point DFTs (radix-2 
technique). The number of complex multiplications is 
significantly lowered to N.log2(N) operations. 
 
The ADC output is real-valued which makes the RFFT 
algorithm [6] applicable to BIST. The radix-2 decimation in 
time RFFT removes redundant operations that occur due to the 
real only input at every FFT stage and hence memory 
requirements and computation processes are reduced 
approximately by a factor of 2. An N-point real input of the 
RFFT results in a complex output with N/2 real and N/2 
imaginary points generated. Figure 1 shows the signal flow for 
an 8-point RFFT. The input of the RFFT routine consist of N 
real data words whose address locations are initially bit 
reversal sorted. Compared to a traditional FFT, not all 
butterflies have to be computed and the order for choosing 
butterfly coefficients is reconfigured. 
 
The grey arcs show the conjugated complex numbers. After 
the RFFT completion, the output array contains N/2 + 1 real 
and N/2-1 complex values. The real-valued (black) and 
complex butterflies (red/blue) are seen in the 3rd RFFT stage. 
An In-Place Processor Design for Real-Value 
FFTs Targeting in-situ Dynamic ADC Test  
Brendan Mullane, Vincent O’Brien 
Circuits and Systems Research Centre, Department of Electronic and Computer Engineering, 
University of Limerick, Limerick, Ireland. 













































Fig. 1. Signal flow graph for 8-point RFFT 
Red lines denote multiplication by the cosine factor, blue lines 
multiplication by the sine factor. The grey dashed lines show 
the corresponding real and imaginary parts for complex 
numbers, calculated according to the twiddle factor 𝑊𝑁
𝑟 in (1), 
where A is the rotation angle and 𝑟 is sample index relative to 




N = Cos(A) − j. Sin(A) , where A =
2πr
N
 . (1) 
As floating-point processing is more area intensive on-chip, 
the proposed CPU uses a fixed-point architecture for ease of 
implementation and for less hardware needs. However, in 
fixed-point implementations, care is required to prevent 
overflows. During the RFFT, each butterfly performs 
arithmetic operations on two n-bit data words. To avoid 
overflow, a modification to the RFFT is made so that prior to 
entering the butterfly each data word is divided by 2, this 
prevents the butterfly outputs from exceeding the maximum 
bit width. To ensure that accuracy is maintained, the ADC 
outputs are pre-scaled to the maximum value allowable before 
being stored in memory. Other higher radix algorithms, such 
as radix-4, radix-8 and split-radix techniques improve speed 
compared to the radix-2 FFT, however they result in more 
complex butterfly structures with longer program code needs 
and more address computations. The aim of this work is to 
produce a processor design suitable for low-area, easy 
implementation into SoCs and so the radix-2 RFFT is selected. 
 
The on-chip BIST computational accuracy needs to be 
comparable with off-chip automated test equipment (ATE). 
An analysis of the RFFT technique for ADC spectral testing 
with fixed-point implementation was used to study the CPU 
datapath bit width needs, resulting in the plot shown in Fig. 2. 
The data shows that in order to spectrally test ADC resolutions 
from 8~14-bits with variable FFT sizes up to 32K-points, a 
datapath of at least 30-bits is necessary to achieve accuracy for 
signal monitoring and safety-critical applications. The plot 
shows that lower ADC resolutions (< 14-bits) can make use of 
shorter  datapath  lengths  resulting in less logic, however  it is 
notable that the accuracy of results below a bit width of at 
least 22-bits is insufficient. Most CPU architectures uses 
multiples of 8-bits for datapath and so the bit width for this 

















CPU datapath bit width











  Fig. 2. Datapath bit width analysis based on SINAD needs  
III. PROCESSOR IMPLEMENTATION 
The processor unit is designed for easy IC integration and is 
capable of reusing SoC main memory. As multiple memory 
instances are not guaranteed to be available for specific 
applications, the CPU architecture uses single-port SRAM and 
memory access is supported by complex instructions such as 
multiplication and CORDIC [7] functions. The CPU is 
focused more on lower hardware size than on execution speed 
and so the multiply and CORDIC functions utilize serial booth 
operations. The instruction set is kept simple for easy 
instruction decoding and user programmability. The CPU is 
supported by 29 instructions categorized into 8 data move, 11 
arithmetic, 9 branch and 1 control instruction command. 
Furthermore, c-code assembler and cycle accurate simulator 






























Fig. 3. CPU datapath unit 
For the RFFT algorithm, direct and in-direct address 
support via the P register is essential as large amounts of data 
from multiple memory locations are processed within the 
software FFT loops. The memory interface also features 
hardwired address bit-reversal alleviating remapping needs. 
The program counter (PC) holds the physical address of the 
executed instruction in memory, while the instruction decoder 
is responsible for decoding the current instruction and setting 
the control lines for all other modules. The datapath unit also 
has CORDIC functionality to generate the sine and cosine 
twiddle factors needed during RFFT butterfly computation. In 
order to perform the CORDIC operation, two barrel-shifters 
also shift right multiple digit locations within a single cycle. 
  
Both ALU1 and ALU2 perform add/subtract operations 
allowing the computation of one CORDIC iteration per cycle 
according to equations (2) ~ (4). 
 xi+1 =  xi − di ∗ yi ∗ 2
−i, (2) 
 yi+1 =  yi + di ∗ xi ∗ 2
−i, (3) 
 zi+1 =  zi − di ∗ tan
−1(2−i). (4) 
 The x, y and z variables are then represented by the A, B 
and T registers in the datapath unit. The CORDIC table 
contains 32 constant (𝑡𝑎𝑛−1(2−𝑖)) values in a read-only 
register file, removing the need for increased ROM storage of 
precomputed cosine and sine values. Based on 𝑑𝑖 = ±1, the 
instruction decode for the CORDIC unit enables CORDIC 
rotational mode for generating sine/cosine twiddle factors and 
vectoring mode for absolute values used in the computation of 







another advantage of combining the CORDIC and RFFT 
technique. In the innermost RFFT loop that computes the 
rotation angle A every cycle, A is set to E=2𝜋/𝑁1 and 
incremented by E after every sin/cosine computation. The 
innermost loop is run N4-1 times, and N4 reaches a maximum 
N/4max. With N4=N1/4, equation (5) proves that the argument 
for computation of cosine/sine values stays within the 
CORDIC’s range of convergence. 
 
The main advantage of the CORDIC algorithm in the 
proposed CPU architecture is that no large memory table is 
required to store FFT coefficients. A further benefit is the fast 
computation of the absolute value of a complex number which 
is performed after the FFT is completed.  
 
In order to evaluate CPU accuracy, both an ideal and non-
ideal full-scale sinewave containing distortion is fed into a 10-
bit ADC. The quantized signal is loaded into memory and 
processor code computes an 8K-pt RFFT, absolute values and 
final parameter extraction. Table I shows that the parameters 
computed by the CPU (fixed-point) closely match those 
computed by MATLAB (double precision). A negligible 
amount of precision is lost due to the iterative CORDIC 
generation of the twiddle factors in the butterfly computation. 
Figure 4 plots the error difference for the CPU with and 
without CORDIC compared to a double precision FFT 
implementation (MATLAB). The error difference is smaller 
than 8*10-6 demonstrating the CPU’s applicability for in-situ 
dynamic ADC test operations. A CPU without CORDIC is 
more precise, however dedicated cosine/sine table storage in 
memory consumes more logic and power overhead. 
 
TABLE I: Comparison of computing dynamic parameters 
Ideal Signal THD  SFDR SINAD  
CPU Implementation -92.085 dB 83.340 dB 61.957 dB 
MATLAB, Double Precision -92.088 dB 83.343 dB 61.962 dB 
Signal with Noise + Distortion THD  SFDR SINAD  
CPU Implementation -58.711 dB 60.185 dB 55.730 dB 
MATLAB, Double Precision -58.711 dB 60.185 dB 55.733 dB 
Fig. 4. Comparison of CPU after FFT absolute calculations 
IV. TEST CHIP & RESULTS 
The CPU was designed using Verilog RTL and integrated 
into an overall ADC BIST SoC design. The design consists of 
a BIST manager unit supporting ADC data acquisition and an 
on-chip serial interface to communicate externally with a PC. 
The CPU core is connected to a single port SRAM that is 
40,936*32-bits to support variable FFT record sizes up to 
32,768-points. The remaining memory is available for general 
purpose program code. The full design was implemented in 
UMC 180nm CMOS technology with a 44-pin CLCC package 
and verified on a test board integrating 8~14-bit ADCs. The 
plot for the chip and test board design is shown in Fig 5. The 
CPU core equates to 11,750 ND2 gates and has an area of 
0.142mm2. The memory unit contains 4-banks of 
10,234*32bit SRAM cells (2mm2 each) that surround the CPU 







   
                     (a)                                                   (b) 
Fig. 5. (a) Fabricated IC and (b) Prototype board design 
 The chip design operates of 3.3V IO and 1.8V core supplies 
with power consumption measured at 150mW operating from 
a 100MHz clock. The large memories consume most power, 
with the standalone CPU core dissipating 4.3mW of power 
during ADC test execution. The design is highly portable to 
advanced CMOS technology nodes due to its synchronous 
clock operation and synthesizable RTL code features. This is 
particularly advantageous in denser nanometer ICs as test time 
can be improved significantly by running the CPU faster at the 
cost of consuming more power. The program code for variable 
FFT records up to 32K-points and dynamic parameter 
extraction is contained within 670 op-codes - this code is 
downloaded to the CPU memory via a serial interface. The test 
time duration excluding data acquisition is a function of the 
number of computation clock cycles it takes to perform the 
RFFT, absolute and parameter extraction phases.  


























Table II gives the clock-cycle breakdown for an 8192-point 
FFT applied to a 12-bit ADC. A significant time is taken up by 
the number of memory accesses that occur during the FFT 
butterfly stages, which interrupts pipelining sequence 
capabilities. During butterfly operation, the indirect address 
modes could be improved using multi-port SRAM or cache 
access to minimize cycle time, however this would be at the 
cost of extra hardware logic. Table II also shows the test time 
duration for FFT sizes from 2K~32K-point. 
TABLE II: Clock cycle breakdown and computation time @100MHz 
 
Table III compares relevant characteristics of other 
memory-based FFT processors calculating 1024-point FFTs. It 
is difficult to achieve a like for like comparison as FFT 
processors vary according to their application needs, but in [8, 
9], normalized area, power per butterfly and FFT’s per energy 
are useful as metrics for comparison. In this case, normalized 
area is the silicon area normalized to 90nm technology 




The adjusted FFT’s per Joule compares the number of FFTs 
calculated per energy scaled according to FFT size in (7) and 
power consumption per butterfly records compares power 





Technology ∗ DataWidth ∗ N. Log𝑟(N)
Power ∗  Exec time ∗ 10−6
, (7) 
 
Power per butterfly operation =  
Power
 Number of Butterflies
. (8) 
 
TABLE III: Mem-based processor comparison for 1024-point FFTs 
  Zhao[10] Lin[11] Huang[12] This work 
Technology 0.18µm 0.18µm 90nm 0.18µm 
Word-length (bits) 16 22 16 32 
Clock rate 20 MHz 20 MHz 160 MHz 100 MHz 
FFT algorithm Radix-2 Radix-8 Radix-16 Radix-2 
FFT size 1K 8K 1K~32K 1K~32K 
Execution Time (us) 512 134 6.575 7,792 
Core Area (mm2) 2.68 2.0167 0.88 0.382 
Power (mW) 81.8 25.2 29 11.3 
FFTs/Joule 704.16 4002.80 19333.42 669.88 
Normalized Area 0.67 0.504 0.88 0.096 
Power/Butterfly (mW) 0.401 0.059 0.181 0.0038 
 
The proposed CPU has the best normalized area metric, 
supporting the low area easy integration needs for on-chip test. 
The power per butterfly operation is also notable since a low-
logic butterfly operation without the need for complex 
multiplier(s) is supported. The execution time due to booth-
serial Multiply and CORDIC operations, makes the FFT per 
energy result lower. In contrast, this design supports very high 
SQNR (96dB) to enable accurate spectral test capability for a 
wide variety of ADC resolutions and is highly scalable to low-
voltage advanced nm process nodes. The CPU not only 
performs FFTs, but also extracts accurate dynamic parameters 
from the frequency spectrum using the CORDIC unit and 
general purpose programming supported by the opcodes. This 
design is particularly useful for safety-critical IC applications 
where in-line testing of ADCs can be carried out during non-
functional periods at startup and during interval down-times, 
ensuring greater product reliability.  
V.  CONCLUSIONS 
This paper presented the analysis and design of a processor 
that executes variable length Real-Value FFTs. The 
architecture is easily implementable into SoCs and is suitable 
for in-situ applications where highly accurate ADC spectral 
analysis can occur when test time duration is not the major 
concern. The CPU architecture is very low logic area and can 
be reused for other non-test processing applications. A 0.18µm 
CMOS chip and test-board implementation validates the ADC 
dynamic measurement capabilities.  
VI. ACKNOWLEDGMENT 
The authors would like to thank Thomas Flesichmann for his 
contribution to the early development of this work. This 
project is funded by Enterprise-Ireland Commercialization 
Fund CF-16-0380P.  
REFERENCES 
[1] D. E. Bellasi, L. Bettini, C. Benkeser, T. Burger, Q. Huang, and C. 
Studer, "VLSI Design of a Monolithic Compressive-Sensing 
Wideband Analog-to-Information Converter," IEEE Journal on 
Emerging and Selected Topics in Circuits and Systems, vol. 3, pp. 
552-565, 2013. 
[2] M. Ayinala, Y. Lao, and K. K. Parhi, "An In-Place FFT 
Architecture for Real-Valued Signals," IEEE Transactions on 
Circuits and Systems II: Express Briefs, vol. 60, pp. 652-656, 2013. 
[3] M. Mahoney, DSP-Based Testing of Analog and Mixed-Signal 
Circuits: IEEE Computer Society Press, 1987. 
[4] J. Duan and D. Chen, "SNR measurement based on linearity test 
for ADC BIST," in IEEE International Symposium of Circuits and 
Systems (ISCAS), 2011, pp. 269-272. 
[5] Y. Duan, T. Chen, and D. Chen, "Low-cost Dithering Generator for 
Accurate ADC Linearity Test " presented at the IEEE International 
Symposium on Circuits and Systems (ISCAS), Montreal, 2016. 
[6] H. Sorensen, D. Jones, M. Heideman, and C. Burrus, "Real-valued 
fast Fourier transform algorithms," IEEE Transactions on 
Acoustics, Speech, and Signal Processing, vol. 35, pp. 849-863, 
1987. 
[7] J. E. Volder, "The CORDIC trigonemetric computing technique," 
IRE Transactions on Electronic Computers, vol. EC-8, no. 3, pp. 
330-334, Sept. 1959. 
[8] B. M. Baas, "A low-power, high-performance, 1024-point FFT 
processor," in IEEE Journal of Solid-State Circuits, vol. 34, no. 3, 
pp. 380-387, Mar. 1999. 
[9] S. Dirlik, "A comparison of FFT processor designs," Computer 
Architecture for Embedded Systems, Department of EEMCS, 
University of Twente, 2013. 
[10] Y. Zhao, A. T. Erdogan, and T. Arslan, "A low-power and domain-
specific reconfigurable FFT fabric for system-on-chip 
applications," in 19th IEEE International Parallel and Distributed 
Processing Symposium, 2005. 
[11] Y.-W. Lin, H.-Y. Liu, and Chen-Yi Lee, "A dynamic scaling FFT 
processor for DVB-T applications," in IEEE Journal of Solid-State 
Circuits, , vol. 39, no. 11, pp. 2005-2013, Nov. 2004. 
[12] S. J. Huang and S. G. Chen, "A high-parallelism memory-based 
FFT processor with high SQNR and novel addressing scheme," 
presented at the IEEE International Symposium on Circuits and 
Systems (ISCAS), Montreal, QC, 2016. 
8-K FFT FFT  ABS Param. Total Time 
Clock cycles 8.43M 0.65M 0.75M 9.83M 98.4ms 
FFT-length 2-K 4-K 8-K 16-K 32-K 
Test time (ms) 21.19 45.66 98.38 212.69 455.06 
